[Samtools-help] bams with malformed region names lead to variant calls using -l to be incorrect

Discussion:

Liyang Diao

2017-03-24 03:24:33 UTC

Dear all,

I have a large number of bam files, where the number of reference "genomes"
is very large, about 1M (bacterial marker gene alignments). A small
fraction of these genomes is poorly named, resulting in the following error
when I run mpileup:

Could not parse the header line: ##contig=<ID=BADNAMES>"

Since this was a small fraction of the references and I am only interested
in a preliminary exploratory analysis, I went ahead and looked into the
VCFs that were generated, assuming (wrongly?) that alignments to these
areas would simply be ignored.

What I found, however, was that the variants called are incorrect--for
example, I have high-confidence SNPs found in regions of zero coverage.

So I am wondering if there is an easy workaround to this problem, or if I
will have to perform realignments of the data, removing or renaming the
culprit references.

I found that, for some reason, using the -r POSITION flag in mpileup
appears to give reasonable results, but that -l produces bad results as
before, but through searching this help archive I found that -r cannot
accept multiple positions in file format.

Any help would be greatly appreciated!
Thanks

Thomas W. Blackwell

2017-03-24 11:48:10 UTC

Permalink

Liyang -

First, you might try samtools version 1.4 from github or
sourceforge ... but I suspect the problem will remain.

I suspect that you are wanting to call variants across multiple
.bam files, using the -b option of samtools mpileup. If that's
true, then by far the easiest option is to fix every .bam file
using samtools reheader.

If the list of ~1M sequence contigs is the same in every .bam file,
then extract a copy of the header from one of the .bam files using
samtools view -H file.bam > tmp.header,
use a text editor to change the bad contig names in file tmp.header,
make a separate directory for the new .bam files and then use a
foreach loop something like (using csh syntax):

cd ../new.dir
foreach file ( ../old.dir/*.bam )
samtools reheader -P tmp.header $file > `basename $file`
end

This simple version loses any header information that changes between
one .bam file and the next (such as sample and library names), but for
a quick look that may be good enough. There's a more complicated version
in which the file tmp.header contains only the header lines that stay the
same for every .bam file and you append the variable information before
each call to samtools reheader. (samtools reheader will read from a pipe.)

- tom blackwell -

Post by Liyang Diao
Dear all,
I have a large number of bam files, where the number of reference "genomes"
is very large, about 1M (bacterial marker gene alignments). A small
fraction of these genomes is poorly named, resulting in the following error
Could not parse the header line: ##contig=<ID=BADNAMES>"
Since this was a small fraction of the references and I am only interested
in a preliminary exploratory analysis, I went ahead and looked into the
VCFs that were generated, assuming (wrongly?) that alignments to these
areas would simply be ignored.
What I found, however, was that the variants called are incorrect--for
example, I have high-confidence SNPs found in regions of zero coverage.
So I am wondering if there is an easy workaround to this problem, or if I
will have to perform realignments of the data, removing or renaming the
culprit references.
I found that, for some reason, using the -r POSITION flag in mpileup
appears to give reasonable results, but that -l produces bad results as
before, but through searching this help archive I found that -r cannot
accept multiple positions in file format.
Any help would be greatly appreciated!
Thanks

Shuoguo Wang

2017-03-24 16:29:22 UTC

Permalink

Liyang,

Probably a good idea to check if the contigs in the header of *all* your
BAM files, matched the reference contigs you provided for mpileup.

I suspect if the BAMs have different headers, it could be an issue.

Also, are there a fixed set of "bad" contigs among all BAMs, or in
different BAMs the "bad" contigs might be different?

when the -I option failed, does it only include "good" contigs you tested
with -r?

hope this helps,

Shuoguo

==
SW

Post by Thomas W. Blackwell
Liyang -
First, you might try samtools version 1.4 from github or
sourceforge ... but I suspect the problem will remain.
I suspect that you are wanting to call variants across multiple
.bam files, using the -b option of samtools mpileup. If that's
true, then by far the easiest option is to fix every .bam file
using samtools reheader.
If the list of ~1M sequence contigs is the same in every .bam file,
then extract a copy of the header from one of the .bam files using
samtools view -H file.bam > tmp.header,
use a text editor to change the bad contig names in file tmp.header,
make a separate directory for the new .bam files and then use a
cd ../new.dir
foreach file ( ../old.dir/*.bam )
samtools reheader -P tmp.header $file > `basename $file`
end
This simple version loses any header information that changes between
one .bam file and the next (such as sample and library names), but for
a quick look that may be good enough. There's a more complicated version
in which the file tmp.header contains only the header lines that stay the
same for every .bam file and you append the variable information before
each call to samtools reheader. (samtools reheader will read from a pipe.)
- tom blackwell -

Post by Liyang Diao
Dear all,
I have a large number of bam files, where the number of reference

"genomes"

Post by Liyang Diao
is very large, about 1M (bacterial marker gene alignments). A small
fraction of these genomes is poorly named, resulting in the following

error

Post by Liyang Diao
Could not parse the header line: ##contig=<ID=BADNAMES>"
Since this was a small fraction of the references and I am only

interested

Post by Liyang Diao
in a preliminary exploratory analysis, I went ahead and looked into the
VCFs that were generated, assuming (wrongly?) that alignments to these
areas would simply be ignored.
What I found, however, was that the variants called are incorrect--for
example, I have high-confidence SNPs found in regions of zero coverage.
So I am wondering if there is an easy workaround to this problem, or if I
will have to perform realignments of the data, removing or renaming the
culprit references.
I found that, for some reason, using the -r POSITION flag in mpileup
appears to give reasonable results, but that -l produces bad results as
before, but through searching this help archive I found that -r cannot
accept multiple positions in file format.
Any help would be greatly appreciated!
Thanks

------------------------------------------------------------
------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help

Liyang Diao

2017-03-24 17:20:00 UTC

Permalink

Sorry, I forgot to copy the help listserve to note that Thomas's suggestion
did indeed solve my problem, text below, along with Thomas's response:

~~~~~~~~~~~~~~~~~~~
Dear Thomas,

Thank you for your advice! I was not aware of the reheader command. I was
able to correct the headers and at first glance at least I am no longer
seeing SNPs called in known zero-coverage regions.

If you wouldn't mind, out of curiosity, does reheader actually only replace
the header of the bam file? None of the "bad" regions had any reads aligned
to them, but if they did, would they be referred to in output by the
corrected names or the original names?

Thanks again for your help! It saved me a lot of time.
Best,
Liyang
~~~~~~~~~~~~~~~~~~~~~~
Liyang -

Glad this helped.

Yes, samtools reheader only replaces the header text. The binary data
content of the .bam file is unchanged. In .bam format, each reference
sequence contig is identified by an integer (perhaps zero based). When
converting to ascii sam format (or in mpileup) the string value shown for
the ith reference sequence contig is the "SN:" value from the ith "@SQ"
line from the header. So, when replacing the header, it's essential to
keep the "@SQ" lines in the same order. Otherwise, the association between
integers and strings gets mixed up. (These are all details that I didn't
want to worry you with before.)

I suspect that with the old .bam files, the header lines with malformed
values were simply ignored, so the string values chosen for later reference
sequence contigs would be off by the number of preceding malformed header
lines. So the variant calls were correct -- they simply weren't given the
correct reference sequence names. But that's just a guess. I haven't
tested whether it's correct.

Post by Thomas W. Blackwell
Liyang -
First, you might try samtools version 1.4 from github or sourceforge ...
but I suspect the problem will remain.
I suspect that you are wanting to call variants across multiple .bam
files, using the -b option of samtools mpileup. If that's true, then by
far the easiest option is to fix every .bam file using samtools reheader.
If the list of ~1M sequence contigs is the same in every .bam file, then
extract a copy of the header from one of the .bam files using
samtools view -H file.bam > tmp.header, use a text editor to
change the bad contig names in file tmp.header, make a separate directory
for the new .bam files and then use a foreach loop something like (using
cd ../new.dir
foreach file ( ../old.dir/*.bam )
samtools reheader -P tmp.header $file > `basename $file`
end
This simple version loses any header information that changes between one
.bam file and the next (such as sample and library names), but for a quick
look that may be good enough. There's a more complicated version in which
the file tmp.header contains only the header lines that stay the same for
every .bam file and you append the variable information before each call to
samtools reheader. (samtools reheader will read from a pipe.)
- tom blackwell -
Dear all,

Post by Liyang Diao
I have a large number of bam files, where the number of reference "genomes"
is very large, about 1M (bacterial marker gene alignments). A small
fraction of these genomes is poorly named, resulting in the following error
Could not parse the header line: ##contig=<ID=BADNAMES>"
Since this was a small fraction of the references and I am only interested
in a preliminary exploratory analysis, I went ahead and looked into the
VCFs that were generated, assuming (wrongly?) that alignments to these
areas would simply be ignored.
What I found, however, was that the variants called are incorrect--for
example, I have high-confidence SNPs found in regions of zero coverage.
So I am wondering if there is an easy workaround to this problem, or if I
will have to perform realignments of the data, removing or renaming the
culprit references.
I found that, for some reason, using the -r POSITION flag in mpileup
appears to give reasonable results, but that -l produces bad results as
before, but through searching this help archive I found that -r cannot
accept multiple positions in file format.
Any help would be greatly appreciated!
Thanks