Post by Brent Pedersen2225 1 -2147483648 14896634174 936 628560
I'm wondering what that means. It was created with samtools 1.3.1
It means there is a bug I suspect! Thanks for raising this.
The format fields are in the document as you saw, but also seen in
code at https://github.com/samtools/htslib/blob/develop/cram/cram_index.c#L568
It's ref seq number (aka "tid" in BAM-world), ref seq start and span
(start+span-1 == end), file offset of container start, offset within
container of slice and slice size. These permit random access from
any given genome coordinate.
-2147483648 is -2^31, also "INT_MIN" in C. This occurs in the
cram_index_build_multiref() function which deals with indexing slices
where multiple references occur within that same container. That
method was added for handling excessively fragmented assemblies
(leading to potentially millions of containers), but is also used for
packing the tiny references together at the end of many human aligned
files. Where there several references around 2225 that also shared
14896634174 as the container offset?
It's unexpected to see INT_MIN make it through that code and out the
other side though! I'll need to study the code and work out how it
happened. It looks like it's somehow got references that aren't in
the headers, but I don't see how that can happen. If you have public
test data that causes this then it would be useful. If not, could you
tell me what the alignments are on this reference? I'm wondering if
it could happen if we have an unmapped but placed read as the only
read aligned to a reference. I'll do some tests...
James
PS. Normally for multi-ref containers I expect to see "span" filled
out correctly. Eg in this example where multiple references share the
same container/slice offset:
35 88 36030 3185574033 1205 15021
36 43 37439 3185590285 1044 63977
37 176 37857 3185590285 1044 63977
38 125 20358 3185590285 1044 63977
38 20404 18061 3185655334 1135 65872
39 103 38695 3185655334 1135 65872
40 175 39568 3185655334 1135 65872
41 7 6593 3185655334 1135 65872
41 6613 33242 3185722369 977 68659
42 40 39217 3185722369 977 68659
43 74 12165 3185722369 977 68659
43 12207 27760 3185792033 929 66582
44 60 19052 3185792033 929 66582
44 19131 21364 3185859572 1138 65580
--
James Bonfield (***@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.