Tim Fennell
2010-06-24 13:28:52 UTC
Hi All,
As we've been writing the Java implementation of the BAM indexing code we've run into a puzzling question. I feel like I should probably reduce this to a test-case, but I'm hoping that i'm just missing something conceptual and I'm wrong.
The spec says: In the linear index, for each tiling 16384bp window on the reference, we record the smallest file offset of the alignments that start in the window.
The implication in the spec, and the reality in the code, is that when using the index one accumulates the list of chunks for all bins that are relevant for the query, and then removes from consideration any chunk who's end virtual file offset is before the offset stored in the linear index for the start of the query range.
Firstly: can someone confirm that I'm reading the specification correctly?
Secondly: the C BAM indexing code is a little too dense for me to really read and understand it, is this what it is actually storing in the linear index?
Thinking about this, it would seem that this would usually work reasonably well for short reads with no split alignments. But what happens when you have a very long alignment that starts much earlier and spans the query range? It's end virtual file offset will be very low because it will be stored relatively early in the file, and will be lower than the virtual file offset of the first alignment to start in the bin containing the query interval.
Shouldn't the linear index be storing the virtual file offset of the first alignment to start in, end in, or span each 16kbp region on the reference? What am I missing?
-t
As we've been writing the Java implementation of the BAM indexing code we've run into a puzzling question. I feel like I should probably reduce this to a test-case, but I'm hoping that i'm just missing something conceptual and I'm wrong.
The spec says: In the linear index, for each tiling 16384bp window on the reference, we record the smallest file offset of the alignments that start in the window.
The implication in the spec, and the reality in the code, is that when using the index one accumulates the list of chunks for all bins that are relevant for the query, and then removes from consideration any chunk who's end virtual file offset is before the offset stored in the linear index for the start of the query range.
Firstly: can someone confirm that I'm reading the specification correctly?
Secondly: the C BAM indexing code is a little too dense for me to really read and understand it, is this what it is actually storing in the linear index?
Thinking about this, it would seem that this would usually work reasonably well for short reads with no split alignments. But what happens when you have a very long alignment that starts much earlier and spans the query range? It's end virtual file offset will be very low because it will be stored relatively early in the file, and will be lower than the virtual file offset of the first alignment to start in the bin containing the query interval.
Shouldn't the linear index be storing the virtual file offset of the first alignment to start in, end in, or span each 16kbp region on the reference? What am I missing?
-t