James Bonfield
2017-01-10 15:19:52 UTC
I discovered the samtools view -m option today, which is quite
*baffling*.
The help and man page both state:
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
So firstly there is some confusion on the wording. To me "number of
CIGAR operations" implies 10M is 1 and 3M2I4M1D1M is 5, but this is
not what it means; both are value 10 (the total number of query
sequence bases matched by a cigar operation).
I struggled for a while to work out why it's phrased this way and not
simply "length of mapped query". It is not legal in SAM to have a
CIGAR string and query sequence with mismatched lengths except for
unmapped data, and if we're explicitly stating "CIGAR operations
consuming query sequence" then we're simply counting the sequence
length via a very contorted fashion. The code even calls this option
"min_qlen" internally so it was clearly the intention to check length.
Or so I thought... The code actually does this (commit fbb4b135, Mar
24 2012 by Heng Li, message "filter on query sequence length (not aln
len)"):
+ if ((bam_cigar_type(bam_cigar_op(cigar[k]))&1) || bam_cigar_op(cigar[k]) == BAM_CHARD_CLIP)
+ qlen += bam_cigar_oplen(cigar[k]);
So here it is explicitly cigar ops that consume query sequence *OR*
are hard clips. Why? This sails completely against the help and man
page, as hard clips are quite evidentally not consuming query
sequence.
Is this attempting to mean "length of original query sequence, pre
clipping"? If so it should state this, but I'm wondering what the
actual intention was as it's not clear in the minimal commit message.
The message implies there were earlier commits and this is a fix ("not
aln len"), but I can't find them so I assume this was squashed at some
stage with the earlier rationale not making it in.
So two questions.
1) For Heng: what task was this code originally added for? Is it an
error that hard clips are considered as query bases? Either the code
needs fixing or the help message, but I am unsure which is in error.
2) For all: do you use this option, if so what for? Were you aware of
hard-clips being counted?
Regards,
James
--
James Bonfield (***@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
*baffling*.
The help and man page both state:
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
So firstly there is some confusion on the wording. To me "number of
CIGAR operations" implies 10M is 1 and 3M2I4M1D1M is 5, but this is
not what it means; both are value 10 (the total number of query
sequence bases matched by a cigar operation).
I struggled for a while to work out why it's phrased this way and not
simply "length of mapped query". It is not legal in SAM to have a
CIGAR string and query sequence with mismatched lengths except for
unmapped data, and if we're explicitly stating "CIGAR operations
consuming query sequence" then we're simply counting the sequence
length via a very contorted fashion. The code even calls this option
"min_qlen" internally so it was clearly the intention to check length.
Or so I thought... The code actually does this (commit fbb4b135, Mar
24 2012 by Heng Li, message "filter on query sequence length (not aln
len)"):
+ if ((bam_cigar_type(bam_cigar_op(cigar[k]))&1) || bam_cigar_op(cigar[k]) == BAM_CHARD_CLIP)
+ qlen += bam_cigar_oplen(cigar[k]);
So here it is explicitly cigar ops that consume query sequence *OR*
are hard clips. Why? This sails completely against the help and man
page, as hard clips are quite evidentally not consuming query
sequence.
Is this attempting to mean "length of original query sequence, pre
clipping"? If so it should state this, but I'm wondering what the
actual intention was as it's not clear in the minimal commit message.
The message implies there were earlier commits and this is a fix ("not
aln len"), but I can't find them so I assume this was squashed at some
stage with the earlier rationale not making it in.
So two questions.
1) For Heng: what task was this code originally added for? Is it an
error that hard clips are considered as query bases? Either the code
needs fixing or the help message, but I am unsure which is in error.
2) For all: do you use this option, if so what for? Were you aware of
hard-clips being counted?
Regards,
James
--
James Bonfield (***@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.