Discussion:
[Samtools-help] Samtools view -m
James Bonfield
2017-01-10 15:19:52 UTC
Permalink
I discovered the samtools view -m option today, which is quite
*baffling*.

The help and man page both state:

-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]

So firstly there is some confusion on the wording. To me "number of
CIGAR operations" implies 10M is 1 and 3M2I4M1D1M is 5, but this is
not what it means; both are value 10 (the total number of query
sequence bases matched by a cigar operation).

I struggled for a while to work out why it's phrased this way and not
simply "length of mapped query". It is not legal in SAM to have a
CIGAR string and query sequence with mismatched lengths except for
unmapped data, and if we're explicitly stating "CIGAR operations
consuming query sequence" then we're simply counting the sequence
length via a very contorted fashion. The code even calls this option
"min_qlen" internally so it was clearly the intention to check length.

Or so I thought... The code actually does this (commit fbb4b135, Mar
24 2012 by Heng Li, message "filter on query sequence length (not aln
len)"):

+ if ((bam_cigar_type(bam_cigar_op(cigar[k]))&1) || bam_cigar_op(cigar[k]) == BAM_CHARD_CLIP)
+ qlen += bam_cigar_oplen(cigar[k]);

So here it is explicitly cigar ops that consume query sequence *OR*
are hard clips. Why? This sails completely against the help and man
page, as hard clips are quite evidentally not consuming query
sequence.

Is this attempting to mean "length of original query sequence, pre
clipping"? If so it should state this, but I'm wondering what the
actual intention was as it's not clear in the minimal commit message.
The message implies there were earlier commits and this is a fix ("not
aln len"), but I can't find them so I assume this was squashed at some
stage with the earlier rationale not making it in.

So two questions.

1) For Heng: what task was this code originally added for? Is it an
error that hard clips are considered as query bases? Either the code
needs fixing or the help message, but I am unsure which is in error.

2) For all: do you use this option, if so what for? Were you aware of
hard-clips being counted?

Regards,

James
--
James Bonfield (***@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Devon Ryan
2017-01-10 15:46:43 UTC
Permalink
I once attempted to use this option to quickly remove most non-spliced
alignments from an RNAseq dataset. The code snippet explains why that
attempt failed spectacularly. I hadn't a clue that this included
hard-clipping operations, though I see alignments with them so rarely
that that wouldn't have been an issue.

Devon
--
Devon Ryan, Ph.D.
Email: ***@dpryan.com
Data Manager/Bioinformatician
Max Planck Institute of Immunobiology and Epigenetics
Stübeweg 51
79108 Freiburg
Germany
Post by James Bonfield
I discovered the samtools view -m option today, which is quite
*baffling*.
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
So firstly there is some confusion on the wording. To me "number of
CIGAR operations" implies 10M is 1 and 3M2I4M1D1M is 5, but this is
not what it means; both are value 10 (the total number of query
sequence bases matched by a cigar operation).
I struggled for a while to work out why it's phrased this way and not
simply "length of mapped query". It is not legal in SAM to have a
CIGAR string and query sequence with mismatched lengths except for
unmapped data, and if we're explicitly stating "CIGAR operations
consuming query sequence" then we're simply counting the sequence
length via a very contorted fashion. The code even calls this option
"min_qlen" internally so it was clearly the intention to check length.
Or so I thought... The code actually does this (commit fbb4b135, Mar
24 2012 by Heng Li, message "filter on query sequence length (not aln
+ if ((bam_cigar_type(bam_cigar_op(cigar[k]))&1) || bam_cigar_op(cigar[k]) == BAM_CHARD_CLIP)
+ qlen += bam_cigar_oplen(cigar[k]);
So here it is explicitly cigar ops that consume query sequence *OR*
are hard clips. Why? This sails completely against the help and man
page, as hard clips are quite evidentally not consuming query
sequence.
Is this attempting to mean "length of original query sequence, pre
clipping"? If so it should state this, but I'm wondering what the
actual intention was as it's not clear in the minimal commit message.
The message implies there were earlier commits and this is a fix ("not
aln len"), but I can't find them so I assume this was squashed at some
stage with the earlier rationale not making it in.
So two questions.
1) For Heng: what task was this code originally added for? Is it an
error that hard clips are considered as query bases? Either the code
needs fixing or the help message, but I am unsure which is in error.
2) For all: do you use this option, if so what for? Were you aware of
hard-clips being counted?
Regards,
James
--
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Samtools-devel mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-devel
Loading...