Discussion:
[Samtools-help] FW: samtools sort for paired-end .bam cannot find chromosome name in text header
Holbrook J.
2017-01-03 17:05:07 UTC
Permalink
Resent after confirming my email to the subscribers list

From: "Holbrook J." <***@soton.ac.uk<mailto:***@soton.ac.uk>>
Date: Tuesday, 3 January 2017 at 16:21
To: "samtools-***@lists.sourceforge.net<mailto:samtools-***@lists.sourceforge.net>" <samtools-***@lists.sourceforge.net<mailto:samtools-***@lists.sourceforge.net>>
Subject: samtools sort for paired-end .bam cannot find chromosome name in text header

Dear Samtools help members

Happy 2017! I would be very grateful for your help:

I am trying to manipulate .bam files created by ernebs5 (http://erne.sourceforge.net) aligning against hg19.
I am running samtools 1.3.1
I have ~ 1.5 mil singleton and ~ 100 mil paired end reads from each sample (the singletons are when one of the pair failed QC).
For the singleton alignments, I was able to use Samtools to sort, index, filter, index again and calculate coverage relative to my .bed file.
However, I can not sort my paired-end read alignments.
I am running:
samtools sort -T /dev/shm/jostemp -@ 8 -m 4G -o sample1b_paired_sorted.bam Sample1b_unmasked.bam

I get an error message that starts:
[bam_sort_core] merging from 16 files……
[E::trans_tbl_add_sq] @SQ SN (chr1) found in binary header but not text header.
[E::trans_tbl_add_sq] @SQ SN (chr10) found in binary header but not text header.
…..
There is a line for every chromosome name in my file.

I did a diff for the headers between my singleton .bam file and my paired end .bam file and there was no difference for the @SQ lines.

There is enough space in the /dev/shm/jostemp/ path for the temp files and my HPC administrator says my resource usage is well under limits.
The flagstat for the input .bam file looks fine to my inexperienced eye and it converts fine to a .sam file which again looks OK.

I have posted this problem on BioStars: https://www.biostars.org/p/228119/#229683
I’ve also checked out this which seems similar but not the same: https://github.com/samtools/samtools/issues/548

Thanks for any suggestions

Jo
Robert Davies
2017-01-04 09:39:39 UTC
Permalink
Post by Holbrook J.
Dear Samtools help members
I am trying to manipulate .bam files created by ernebs5 (http://erne.sourceforge.net) aligning against hg19.
I am running samtools 1.3.1
I have ~ 1.5 mil singleton and ~ 100 mil paired end reads from each sample (the singletons are when one of the pair failed QC).
For the singleton alignments, I was able to use Samtools to sort, index, filter, index again and calculate coverage relative to my .bed file.
However, I can not sort my paired-end read alignments.
[bam_sort_core] merging from 16 files……
…..
There is a line for every chromosome name in my file.
I suspect that the difference is your singleton file is much smaller than
the paired ends one. If it's small enough to fit in memory, samtools will
sort it all in a single chunk and won't run the merge step (which is where
the header check happens). Did you see a "merging from ... files" message
for the singletons?
Post by Holbrook J.
There is enough space in the /dev/shm/jostemp/ path for the temp files and my HPC administrator says my resource usage is well under limits.
The flagstat for the input .bam file looks fine to my inexperienced eye and it converts fine to a .sam file which again looks OK.
I have posted this problem on BioStars: https://www.biostars.org/p/228119/#229683
I’ve also checked out this which seems similar but not the same: https://github.com/samtools/samtools/issues/548
Are you sure you're running samtools 1.3.1? The message about "found in
binary header but not text header" was removed when issue #548 was fixed,
so samtools 1.3.1 shouldn't be able to print the message above. Instead
it should silently fix the problem for you.

Ideally, erne should be putting the @SQ lines in the headers itself. If
it's not, you might want to get in touch with its developers and ask them
to add this feature.

Rob Davies ***@sanger.ac.uk
The Sanger Institute http://www.sanger.ac.uk/
Hinxton, Cambs., Tel. +44 (1223) 834244
CB10 1SA, U.K. Fax. +44 (1223) 494919
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
John Marshall
2017-01-04 10:09:01 UTC
Permalink
Post by Holbrook J.
I am trying to manipulate .bam files created by ernebs5 (http://erne.sourceforge.net) aligning against hg19.
I am running samtools 1.3.1
[...]
Post by Holbrook J.
[bam_sort_core] merging from 16 files……
This error message does not exist in samtools 1.3.1. So it is in fact an earlier version of samtools that is being run, and it would be instructive to use the samtools-1.3.1 you think you have installed

/full/path/to/desired/version/of/samtools sort -T etc etc

and see if this version accepts these files.
Post by Holbrook J.
I’ve also checked out this which seems similar but not the same: https://github.com/samtools/samtools/issues/548
Issue 548 is about BAM files generated with RSEM, which writes BAM files without textual headers, and led to the change in samtools 1.3.1 to just deal with the situation rather than producing the error messages above. It is not unlikely that ernebs5 similarly writes BAM files without textual @SQ headers and actually using samtools-1.3.1 will solve your problem.

It would be useful if you could post the first 100K or so of Sample1b_unmasked.bam somewhere we can download it and examine the raw headers.

TL;DR i.e. what Rob said. But note that when you output SAM with samtools view, samtools appends basic @SQ headers if there aren't already any. So Rob's `samtools view -H ... | grep '^@SQ'` will display some headers whether the file contains textual headers or not. Removing the grep will allow some educated guesses to be made about whether @SQ headers seen with samtools view are synthetic or really in the input file: if there are any other (e.g. @RG) headers *after* the block of @SQ headers, or if the @SQ headers have any fields beyond SN and LN, then they are definitely real.

John
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Robert Davies
2017-01-04 10:24:05 UTC
Permalink
Post by John Marshall
TL;DR i.e. what Rob said. But note that when you output SAM with
headers whether the file contains textual headers or not. Removing the
headers seen with samtools view are synthetic or really in the input
they are definitely real.
Yes, I've just spotted this - samtools view fixes the problem too.

Looking at the bam file with 'zless' might be the quickest way of checking
for @SQ lines. The text header is fairly readable at the start.

Rob Davies ***@sanger.ac.uk
The Sanger Institute http://www.sanger.ac.uk/
Hinxton, Cambs., Tel. +44 (1223) 834244
CB10 1SA, U.K. Fax. +44 (1223) 494919
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Holbrook J.
2017-01-04 10:58:22 UTC
Permalink
Dear Rob and John

You are right! When I use zless I don't think I have any @SQ headers (attached).
The headers that I saw using samtools view header or the grep command only had SN: and LN: fields.

So...
Erne is not outputting @SQ headers.
It just looks that way when I use samtools view because samtools automatically generates basic ones.
No @SQ headers causes samtools sort to fail, but only when the alignments a big enough to need to merge (hence the singletons worked but the paired did not)
Samtools 1.3.1 will not have this problem but I must be somehow running the biobuild samtools 1.3 on my HPC.

I have just set off samtools sort jobs specifying the direct path to samtools 1.3.1 - they should finish by the end of today.
If they work I will contact the erne authors to ask them to output @SQ headers in future.
If they don't work I will be back in touch.

Thank-you so much for this help. You can tell I am a newbie but I will try to use this mailing list wisely and sparingly.

Jo
-----Original Message-----
From: Robert Davies [mailto:***@sanger.ac.uk]
Sent: 04 January 2017 10:24
To: John Marshall <***@sanger.ac.uk>
Cc: Holbrook J. <***@soton.ac.uk>; samtools-***@lists.sourceforge.net
Subject: Re: [Samtools-help] samtools sort for paired-end .bam cannot find chromosome name in text header
Post by John Marshall
TL;DR i.e. what Rob said. But note that when you output SAM with
display some headers whether the file contains textual headers or not.
Removing the grep will allow some educated guesses to be made about
the input
@SQ headers, or if the @SQ headers have any fields beyond SN and LN,
then they are definitely real.
Yes, I've just spotted this - samtools view fixes the problem too.

Looking at the bam file with 'zless' might be the quickest way of checking for @SQ lines. The text header is fairly readable at the start.

Rob Davies ***@sanger.ac.uk
The Sanger Institute http://www.sanger.ac.uk/
Hinxton, Cambs., Tel. +44 (1223) 834244
CB10 1SA, U.K. Fax. +44 (1223) 494919


--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Holbrook J.
2017-01-04 17:09:06 UTC
Permalink
Dear Robert, John and Sam

I just wanted to say thanks. The new jobs finished successfully. I will contact the erne authors about outputting some @SQ headers.

Jo

From: "Holbrook J." <***@soton.ac.uk<mailto:***@soton.ac.uk>>
Date: Wednesday, 4 January 2017 at 10:58
To: Robert Davies <***@sanger.ac.uk<mailto:***@sanger.ac.uk>>, John Marshall <***@sanger.ac.uk<mailto:***@sanger.ac.uk>>
Cc: "samtools-***@lists.sourceforge.net<mailto:samtools-***@lists.sourceforge.net>" <samtools-***@lists.sourceforge.net<mailto:samtools-***@lists.sourceforge.net>>
Subject: RE: [Samtools-help] samtools sort for paired-end .bam cannot find chromosome name in text header

Dear Rob and John

You are right! When I use zless I don’t think I have any @SQ headers (attached).
The headers that I saw using samtools view header or the grep command only had SN: and LN: fields.

So...
Erne is not outputting @SQ headers.
It just looks that way when I use samtools view because samtools automatically generates basic ones.
No @SQ headers causes samtools sort to fail, but only when the alignments a big enough to need to merge (hence the singletons worked but the paired did not)
Samtools 1.3.1 will not have this problem but I must be somehow running the biobuild samtools 1.3 on my HPC.

I have just set off samtools sort jobs specifying the direct path to samtools 1.3.1 - they should finish by the end of today.
If they work I will contact the erne authors to ask them to output @SQ headers in future.
If they don’t work I will be back in touch.

Thank-you so much for this help. You can tell I am a newbie but I will try to use this mailing list wisely and sparingly.

Jo
-----Original Message-----
From: Robert Davies [mailto:***@sanger.ac.uk]
Sent: 04 January 2017 10:24
To: John Marshall <***@sanger.ac.uk<mailto:***@sanger.ac.uk>>
Cc: Holbrook J. <***@soton.ac.uk<mailto:***@soton.ac.uk>>; samtools-***@lists.sourceforge.net<mailto:samtools-***@lists.sourceforge.net>
Subject: Re: [Samtools-help] samtools sort for paired-end .bam cannot find chromosome name in text header

On Wed, 4 Jan 2017, John Marshall wrote:


TL;DR i.e. what Rob said. But note that when you output SAM with
samtools view, samtools appends basic @SQ headers if there aren't
already any. So Rob's `samtools view -H ... | grep '^@SQ'` will
display some headers whether the file contains textual headers or not.
Removing the grep will allow some educated guesses to be made about
whether @SQ headers seen with samtools view are synthetic or really in
the input
file: if there are any other (e.g. @RG) headers *after* the block of
@SQ headers, or if the @SQ headers have any fields beyond SN and LN,
then they are definitely real.

Yes, I've just spotted this - samtools view fixes the problem too.

Looking at the bam file with 'zless' might be the quickest way of checking for @SQ lines. The text header is fairly readable at the start.

Rob Davies ***@sanger.ac.uk<mailto:***@sanger.ac.uk>
The Sanger Institute http://www.sanger.ac.uk/
Hinxton, Cambs., Tel. +44 (1223) 834244
CB10 1SA, U.K. Fax. +44 (1223) 494919


--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Loading...