Discussion:
[Samtools-help] Understanding the linear index
Tim Fennell
2010-06-24 13:28:52 UTC
Permalink
Hi All,

As we've been writing the Java implementation of the BAM indexing code we've run into a puzzling question. I feel like I should probably reduce this to a test-case, but I'm hoping that i'm just missing something conceptual and I'm wrong.

The spec says: In the linear index, for each tiling 16384bp window on the reference, we record the smallest file offset of the alignments that start in the window.

The implication in the spec, and the reality in the code, is that when using the index one accumulates the list of chunks for all bins that are relevant for the query, and then removes from consideration any chunk who's end virtual file offset is before the offset stored in the linear index for the start of the query range.

Firstly: can someone confirm that I'm reading the specification correctly?

Secondly: the C BAM indexing code is a little too dense for me to really read and understand it, is this what it is actually storing in the linear index?

Thinking about this, it would seem that this would usually work reasonably well for short reads with no split alignments. But what happens when you have a very long alignment that starts much earlier and spans the query range? It's end virtual file offset will be very low because it will be stored relatively early in the file, and will be lower than the virtual file offset of the first alignment to start in the bin containing the query interval.

Shouldn't the linear index be storing the virtual file offset of the first alignment to start in, end in, or span each 16kbp region on the reference? What am I missing?

-t
Heng Li
2010-06-24 14:32:19 UTC
Permalink
Post by Tim Fennell
Hi All,
As we've been writing the Java implementation of the BAM indexing code we've run into a puzzling question. I feel like I should probably reduce this to a test-case, but I'm hoping that i'm just missing something conceptual and I'm wrong.
The spec says: In the linear index, for each tiling 16384bp window on the reference, we record the smallest file offset of the alignments that start in the window.
The spec is wrong here. We should store the file offset of the leftmost
alignment that *overlaps* (not starts in) the 16kb window. The
implementation is different from what the spec is saying.
Post by Tim Fennell
The implication in the spec, and the reality in the code, is that when using the index one accumulates the list of chunks for all bins that are relevant for the query, and then removes from consideration any chunk who's end virtual file offset is before the offset stored in the linear index for the start of the query range.
Firstly: can someone confirm that I'm reading the specification correctly?
Secondly: the C BAM indexing code is a little too dense for me to really read and understand it, is this what it is actually storing in the linear index?
See the latest code from SVN. The old code is actually buggy. It does
not affect the correctness, but it may affect efficiency (in terms of #
lseek calls) in some cases.
Post by Tim Fennell
Thinking about this, it would seem that this would usually work reasonably well for short reads with no split alignments. But what happens when you have a very long alignment that starts much earlier and spans the query range? It's end virtual file offset will be very low because it will be stored relatively early in the file, and will be lower than the virtual file offset of the first alignment to start in the bin containing the query interval.
If there is an alignment spanning the entire chromosome, the linear
indexing will be useless. We will fall back completely to the binning
index, which is correct but not efficient.
Post by Tim Fennell
Shouldn't the linear index be storing the virtual file offset of the first alignment to start in, end in, or span each 16kbp region on the reference?
I believe this is what I intend to mean and what is implemented in
samtools: the first alignment to overlap or span each 16kbp region.

Heng
Post by Tim Fennell
What am I missing?
-t
------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Loading...