[Samtools-help] samtools merge issue

Discussion:

Margres, Mark J

2017-05-12 16:08:13 UTC

Hello,

I am having an issue with merging different bam assemblies using samtools. In my pipeline, I merge reads using flash2, and then perform reference-based assemblies using bwa, one assembly uses just the merged reads and a second assembly uses just the unmerged reads. I then attempt to merge the assemblies using samtools and get the following output:

[W::bam_merge_core2] No @HD tag found.

I am assuming this is due to the unmerged read assembly having a different header than the merged reads, but I am not sure why (and also not sure if this is a bwa issue or a samtools issue). Is there a way for me to merge and then sort the assemblies with maintaining the correct header tags? I have the original assemblies to work with, but have assembled abut a dozen genomes and would prefer to not re-assemble if possible. Code is below. Thanks.

# Run BWA to map PE samples to reference genome
# -t number of threads -R read group header
logFile = jp(resultsDir, sample + '_mapping.log')
cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
bwaIndex, jp(resultsDir, sample + "_cleaned_PE1.fastq.gz"),
jp(resultsDir, sample + "_cleaned_PE2.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "PE.bam"),
"2>", logFile])
log(cmd, logCommands)
os.system(cmd)
# Run BWA to map SE samples to reference genome
cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
bwaIndex, jp(resultsDir, sample + "_cleaned_SE.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "SE.bam"),
"2>>", logFile])
log(cmd, logCommands)
os.system(cmd)
# merge and sort
cmd = ' '.join(['samtools merge -c', jp(bamFolder, sample + ".bam"), jp(bamFolder, sample + "PE.bam"), jp(bamFolder, sample + "SE.bam")])
log(cmd, logCommands)
os.system(cmd)
# sort bam file; -@ number of threads
cmd = ' '.join(['samtools sort -o', jp(bamFolder, sample) + "sorted.bam", ' -@ 50', jp(bamFolder, sample + ".bam")])
log(cmd, logCommands)
os.system(cmd)
# Mark and remove PCR duplicates
cmd = ' '.join([picard + "MarkDuplicates INPUT=" + jp(bamFolder, sample + "sorted.bam"), ' OUTPUT=' + jp(bamFolder, sample + "_markdup.bam"),
' METRICS_FILE=' + jp(bamFolder, sample + ".metrics"), ' REMOVE_DUPLICATES=true ',
' ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT', '>>', logFile, '2>&1'])
log(cmd, logCommands)
os.system(cmd)
# Index bam file:
cmd = ' '.join(['samtools index', jp(bamFolder, sample) + "_markdup.bam"])

----------------------
Mark J. Margres, Ph.D.
Postdoc, Storfer Lab
School of Biological Sciences
Washington State University
Pullman, WA 99164-4236

Thomas W. Blackwell

2017-05-12 16:50:46 UTC

Permalink

Mark -

Just guessing: One may need to specify "merge -f" in case an
earlier invocation of the command has created an empty file with
the path generated by "jp(bamFolder, sample + ".bam")". Also may
need to specify "merge -n" if, in fact, the output from bwa is
already properly name-sorted. Again, just guessing.

- tom blackwell -

On Fri, 12 May 2017, Margres, Mark J wrote:

> Hello,
>
> I am having an issue with merging different bam assemblies using samtools. In my pipeline, I merge reads using flash2, and then perform reference-based assemblies using bwa, one assembly uses just the merged reads and a second assembly uses just the unmerged reads. I then attempt to merge the assemblies using samtools and get the following output:
>
>
> [W::bam_merge_core2] No @HD tag found.
>
> I am assuming this is due to the unmerged read assembly having a different header than the merged reads, but I am not sure why (and also not sure if this is a bwa issue or a samtools issue). Is there a way for me to merge and then sort the assemblies with maintaining the correct header tags? I have the original assemblies to work with, but have assembled abut a dozen genomes and would prefer to not re-assemble if possible. Code is below. Thanks.
>
> # Run BWA to map PE samples to reference genome
> # -t number of threads -R read group header
> logFile = jp(resultsDir, sample + '_mapping.log')
> cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
> bwaIndex, jp(resultsDir, sample + "_cleaned_PE1.fastq.gz"),
> jp(resultsDir, sample + "_cleaned_PE2.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "PE.bam"),
> "2>", logFile])
> log(cmd, logCommands)
> os.system(cmd)
> # Run BWA to map SE samples to reference genome
> cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
> bwaIndex, jp(resultsDir, sample + "_cleaned_SE.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "SE.bam"),
> "2>>", logFile])
> log(cmd, logCommands)
> os.system(cmd)
> # merge and sort
> cmd = ' '.join(['samtools merge -c', jp(bamFolder, sample + ".bam"), jp(bamFolder, sample + "PE.bam"), jp(bamFolder, sample + "SE.bam")])
> log(cmd, logCommands)
> os.system(cmd)
> # sort bam file; -@ number of threads
> cmd = ' '.join(['samtools sort -o', jp(bamFolder, sample) + "sorted.bam", ' -@ 50', jp(bamFolder, sample + ".bam")])
> log(cmd, logCommands)
> os.system(cmd)
> # Mark and remove PCR duplicates
> cmd = ' '.join([picard + "MarkDuplicates INPUT=" + jp(bamFolder, sample + "sorted.bam"), ' OUTPUT=' + jp(bamFolder, sample + "_markdup.bam"),
> ' METRICS_FILE=' + jp(bamFolder, sample + ".metrics"), ' REMOVE_DUPLICATES=true ',
> ' ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT', '>>', logFile, '2>&1'])
> log(cmd, logCommands)
> os.system(cmd)
> # Index bam file:
> cmd = ' '.join(['samtools index', jp(bamFolder, sample) + "_markdup.bam"])
>
> ----------------------
> Mark J. Margres, Ph.D.
> Postdoc, Storfer Lab
> School of Biological Sciences
> Washington State University
> Pullman, WA 99164-4236
>

Margres, Mark J

2017-05-12 17:18:26 UTC

Permalink

Hi Tom,

The "sample + .bam" file was not called until that command, so it should not be an empty file. As you can see below, the alignments appear to be sorted by chromosome, not read, so I think “merge -c” is appropriate. Do I need to sort prior to merging?

Input file one for merging:

[***@login-p1n01 02-Mapped]$ samtools view -H 165499PE.bam
@SQ SN:chr1_GL834412_random LN:2180112
@SQ SN:chr1_GL834413_random LN:373546
@SQ SN:chr1_GL834414_random LN:1075293
@SQ SN:chr1_GL834415_random LN:174856
@SQ SN:chr1_GL834416_random LN:817733
@SQ SN:chr1_GL834417_random LN:3669695
@SQ SN:chr1_GL834418_random LN:1117890
@SQ SN:chr1_GL834419_random LN:2046010

Input file two for merging:

[***@login-p1n01 02-Mapped]$ samtools view -H 165499SE.bam
@SQ SN:chr1_GL834412_random LN:2180112
@SQ SN:chr1_GL834413_random LN:373546
@SQ SN:chr1_GL834414_random LN:1075293
@SQ SN:chr1_GL834415_random LN:174856
@SQ SN:chr1_GL834416_random LN:817733
@SQ SN:chr1_GL834417_random LN:3669695
@SQ SN:chr1_GL834418_random LN:1117890
@SQ SN:chr1_GL834419_random LN:2046010

Merged file:

[***@login-p1n01 02-Mapped]$ samtools view -H 165499.bam
@SQ SN:chr1_GL834412_random LN:2180112
@SQ SN:chr1_GL834413_random LN:373546
@SQ SN:chr1_GL834414_random LN:1075293
@SQ SN:chr1_GL834415_random LN:174856
@SQ SN:chr1_GL834416_random LN:817733
@SQ SN:chr1_GL834417_random LN:3669695
@SQ SN:chr1_GL834418_random LN:1117890
@SQ SN:chr1_GL834419_random LN:2046010

Sorted file:

[***@login-p1n01 02-Mapped]$ samtools view -H 165499sorted.bam
@HD VN:1.3 SO:coordinate
@SQ SN:chr1_GL834412_random LN:2180112
@SQ SN:chr1_GL834413_random LN:373546
@SQ SN:chr1_GL834414_random LN:1075293
@SQ SN:chr1_GL834415_random LN:174856
@SQ SN:chr1_GL834416_random LN:817733
@SQ SN:chr1_GL834417_random LN:3669695
@SQ SN:chr1_GL834418_random LN:1117890
@SQ SN:chr1_GL834419_random LN:2046010

Thanks.

Mark

----------------------
Mark J. Margres, Ph.D.
Postdoc, Storfer Lab
School of Biological Sciences
Washington State University
Pullman, WA 99164-4236

________________________________________
From: Thomas W. Blackwell [***@umich.edu]
Sent: Friday, May 12, 2017 9:50 AM
To: Margres, Mark J
Cc: samtools-***@lists.sourceforge.net
Subject: Re: [Samtools-help] samtools merge issue

Mark -

Just guessing: One may need to specify "merge -f" in case an
earlier invocation of the command has created an empty file with
the path generated by "jp(bamFolder, sample + ".bam")". Also may
need to specify "merge -n" if, in fact, the output from bwa is
already properly name-sorted. Again, just guessing.

- tom blackwell -

On Fri, 12 May 2017, Margres, Mark J wrote:

> Hello,
>
> I am having an issue with merging different bam assemblies using samtools. In my pipeline, I merge reads using flash2, and then perform reference-based assemblies using bwa, one assembly uses just the merged reads and a second assembly uses just the unmerged reads. I then attempt to merge the assemblies using samtools and get the following output:
>
>
> [W::bam_merge_core2] No @HD tag found.
>
> I am assuming this is due to the unmerged read assembly having a different header than the merged reads, but I am not sure why (and also not sure if this is a bwa issue or a samtools issue). Is there a way for me to merge and then sort the assemblies with maintaining the correct header tags? I have the original assemblies to work with, but have assembled abut a dozen genomes and would prefer to not re-assemble if possible. Code is below. Thanks.
>
> # Run BWA to map PE samples to reference genome
> # -t number of threads -R read group header
> logFile = jp(resultsDir, sample + '_mapping.log')
> cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
> bwaIndex, jp(resultsDir, sample + "_cleaned_PE1.fastq.gz"),
> jp(resultsDir, sample + "_cleaned_PE2.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "PE.bam"),
> "2>", logFile])
> log(cmd, logCommands)
> os.system(cmd)
> # Run BWA to map SE samples to reference genome
> cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
> bwaIndex, jp(resultsDir, sample + "_cleaned_SE.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "SE.bam"),
> "2>>", logFile])
> log(cmd, logCommands)
> os.system(cmd)
> # merge and sort
> cmd = ' '.join(['samtools merge -c', jp(bamFolder, sample + ".bam"), jp(bamFolder, sample + "PE.bam"), jp(bamFolder, sample + "SE.bam")])
> log(cmd, logCommands)
> os.system(cmd)
> # sort bam file; -@ number of threads
> cmd = ' '.join(['samtools sort -o', jp(bamFolder, sample) + "sorted.bam", ' -@ 50', jp(bamFolder, sample + ".bam")])
> log(cmd, logCommands)
> os.system(cmd)
> # Mark and remove PCR duplicates
> cmd = ' '.join([picard + "MarkDuplicates INPUT=" + jp(bamFolder, sample + "sorted.bam"), ' OUTPUT=' + jp(bamFolder, sample + "_markdup.bam"),
> ' METRICS_FILE=' + jp(bamFolder, sample + ".metrics"), ' REMOVE_DUPLICATES=true ',
> ' ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT', '>>', logFile, '2>&1'])
> log(cmd, logCommands)
> os.system(cmd)
> # Index bam file:
> cmd = ' '.join(['samtools index', jp(bamFolder, sample) + "_markdup.bam"])
>
> ----------------------
> Mark J. Margres, Ph.D.
> Postdoc, Storfer Lab
> School of Biological Sciences
> Washington State University
> Pullman, WA 99164-4236
>

Thomas W. Blackwell

2017-05-12 17:34:47 UTC

Permalink

Mark -

"Sort order" refers to the ordering of records for individual sequence reads
in the body of a .bam file, not to the ordering of the reference sequence
contigs in the header. Bwa, by default, preserves the ordering of sequence
reads that's found in the input .fastq file(s). It's customary to sort each
.bam file separately into chromosome coordinate order before merging. In
that case, the default without "-n" is correct.

Separately, why do none of your *PE.bam or *SE.bam files begin with an
"@HD" line before the first "@SQ" ? I woould expect samtools to supply this.

- tom blackwell -

On Fri, 12 May 2017, Margres, Mark J wrote:

> Hi Tom,
>
> The "sample + .bam" file was not called until that command, so it should not be an empty file. As you can see below, the alignments appear to be sorted by chromosome, not read, so I think ?merge -c? is appropriate. Do I need to sort prior to merging?
>
> Input file one for merging:
>
> [***@login-p1n01 02-Mapped]$ samtools view -H 165499PE.bam
> @SQ SN:chr1_GL834412_random LN:2180112
> @SQ SN:chr1_GL834413_random LN:373546
> @SQ SN:chr1_GL834414_random LN:1075293
> @SQ SN:chr1_GL834415_random LN:174856
> @SQ SN:chr1_GL834416_random LN:817733
> @SQ SN:chr1_GL834417_random LN:3669695
> @SQ SN:chr1_GL834418_random LN:1117890
> @SQ SN:chr1_GL834419_random LN:2046010
>
> Input file two for merging:
>
> [***@login-p1n01 02-Mapped]$ samtools view -H 165499SE.bam
> @SQ SN:chr1_GL834412_random LN:2180112
> @SQ SN:chr1_GL834413_random LN:373546
> @SQ SN:chr1_GL834414_random LN:1075293
> @SQ SN:chr1_GL834415_random LN:174856
> @SQ SN:chr1_GL834416_random LN:817733
> @SQ SN:chr1_GL834417_random LN:3669695
> @SQ SN:chr1_GL834418_random LN:1117890
> @SQ SN:chr1_GL834419_random LN:2046010
>
> Merged file:
>
> [***@login-p1n01 02-Mapped]$ samtools view -H 165499.bam
> @SQ SN:chr1_GL834412_random LN:2180112
> @SQ SN:chr1_GL834413_random LN:373546
> @SQ SN:chr1_GL834414_random LN:1075293
> @SQ SN:chr1_GL834415_random LN:174856
> @SQ SN:chr1_GL834416_random LN:817733
> @SQ SN:chr1_GL834417_random LN:3669695
> @SQ SN:chr1_GL834418_random LN:1117890
> @SQ SN:chr1_GL834419_random LN:2046010
>
> Sorted file:
>
> [***@login-p1n01 02-Mapped]$ samtools view -H 165499sorted.bam
> @HD VN:1.3 SO:coordinate
> @SQ SN:chr1_GL834412_random LN:2180112
> @SQ SN:chr1_GL834413_random LN:373546
> @SQ SN:chr1_GL834414_random LN:1075293
> @SQ SN:chr1_GL834415_random LN:174856
> @SQ SN:chr1_GL834416_random LN:817733
> @SQ SN:chr1_GL834417_random LN:3669695
> @SQ SN:chr1_GL834418_random LN:1117890
> @SQ SN:chr1_GL834419_random LN:2046010
>
> Thanks.
>
> Mark
>
> ----------------------
> Mark J. Margres, Ph.D.
> Postdoc, Storfer Lab
> School of Biological Sciences
> Washington State University
> Pullman, WA 99164-4236
>
> ________________________________________
> From: Thomas W. Blackwell [***@umich.edu]
> Sent: Friday, May 12, 2017 9:50 AM
> To: Margres, Mark J
> Cc: samtools-***@lists.sourceforge.net
> Subject: Re: [Samtools-help] samtools merge issue
>
> Mark -
>
> Just guessing: One may need to specify "merge -f" in case an
> earlier invocation of the command has created an empty file with
> the path generated by "jp(bamFolder, sample + ".bam")". Also may
> need to specify "merge -n" if, in fact, the output from bwa is
> already properly name-sorted. Again, just guessing.
>
> - tom blackwell -
>
> On Fri, 12 May 2017, Margres, Mark J wrote:
>
>> Hello,
>>
>> I am having an issue with merging different bam assemblies using samtools. In my pipeline, I merge reads using flash2, and then perform reference-based assemblies using bwa, one assembly uses just the merged reads and a second assembly uses just the unmerged reads. I then attempt to merge the assemblies using samtools and get the following output:
>>
>>
>> [W::bam_merge_core2] No @HD tag found.
>>
>> I am assuming this is due to the unmerged read assembly having a different header than the merged reads, but I am not sure why (and also not sure if this is a bwa issue or a samtools issue). Is there a way for me to merge and then sort the assemblies with maintaining the correct header tags? I have the original assemblies to work with, but have assembled abut a dozen genomes and would prefer to not re-assemble if possible. Code is below. Thanks.
>>
>> # Run BWA to map PE samples to reference genome
>> # -t number of threads -R read group header
>> logFile = jp(resultsDir, sample + '_mapping.log')
>> cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
>> bwaIndex, jp(resultsDir, sample + "_cleaned_PE1.fastq.gz"),
>> jp(resultsDir, sample + "_cleaned_PE2.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "PE.bam"),
>> "2>", logFile])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # Run BWA to map SE samples to reference genome
>> cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50
-R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
>> bwaIndex, jp(resultsDir, sample + "_cleaned_SE.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "SE.bam"),
>> "2>>", logFile])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # merge and sort
>> cmd = ' '.join(['samtools merge -c', jp(bamFolder, sample + ".bam"), jp(bamFolder, sample + "PE.bam"), jp(bamFolder, sample + "SE.bam")])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # sort bam file; -@ number of threads
>> cmd = ' '.join(['samtools sort -o', jp(bamFolder, sample) + "sorted.bam", ' -@ 50', jp(bamFolder, sample + ".bam")])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # Mark and remove PCR duplicates
>> cmd = ' '.join([picard + "MarkDuplicates INPUT=" + jp(bamFolder, sample + "sorted.bam"), ' OUTPUT=' + jp(bamFolder, sample + "_markdup.bam"),
>> ' METRICS_FILE=' + jp(bamFolder, sample + ".metrics"), ' REMOVE_DUPLICATES=true ',
>> ' ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT', '>>', logFile, '2>&1'])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # Index bam file:
>> cmd = ' '.join(['samtools index', jp(bamFolder, sample) + "_markdup.bam"])
>>
>> ----------------------
>> Mark J. Margres, Ph.D.
>> Postdoc, Storfer Lab
>> School of Biological Sciences
>> Washington State University
>> Pullman, WA 99164-4236
>>

Margres, Mark J

2017-05-12 17:50:39 UTC

Permalink

Tom,

I am not entirely sure. I will try sorting each file prior to merging to see if this fixes the issue. Thanks.

Mark

----------------------
Mark J. Margres, Ph.D.
Postdoc, Storfer Lab
School of Biological Sciences
Washington State University
Pullman, WA 99164-4236

________________________________________
From: Thomas W. Blackwell [***@umich.edu]
Sent: Friday, May 12, 2017 10:34 AM
To: Margres, Mark J
Cc: samtools-***@lists.sourceforge.net
Subject: RE: [Samtools-help] samtools merge issue

Mark -

"Sort order" refers to the ordering of records for individual sequence reads
in the body of a .bam file, not to the ordering of the reference sequence
contigs in the header. Bwa, by default, preserves the ordering of sequence
reads that's found in the input .fastq file(s). It's customary to sort each
.bam file separately into chromosome coordinate order before merging. In
that case, the default without "-n" is correct.

Separately, why do none of your *PE.bam or *SE.bam files begin with an
"@HD" line before the first "@SQ" ? I woould expect samtools to supply this.

- tom blackwell -

On Fri, 12 May 2017, Margres, Mark J wrote:

> Hi Tom,
>
> The "sample + .bam" file was not called until that command, so it should not be an empty file. As you can see below, the alignments appear to be sorted by chromosome, not read, so I think ?merge -c? is appropriate. Do I need to sort prior to merging?
>
> Input file one for merging:
>
> [***@login-p1n01 02-Mapped]$ samtools view -H 165499PE.bam
> @SQ SN:chr1_GL834412_random LN:2180112
> @SQ SN:chr1_GL834413_random LN:373546
> @SQ SN:chr1_GL834414_random LN:1075293
> @SQ SN:chr1_GL834415_random LN:174856
> @SQ SN:chr1_GL834416_random LN:817733
> @SQ SN:chr1_GL834417_random LN:3669695
> @SQ SN:chr1_GL834418_random LN:1117890
> @SQ SN:chr1_GL834419_random LN:2046010
>
> Input file two for merging:
>
> [***@login-p1n01 02-Mapped]$ samtools view -H 165499SE.bam
> @SQ SN:chr1_GL834412_random LN:2180112
> @SQ SN:chr1_GL834413_random LN:373546
> @SQ SN:chr1_GL834414_random LN:1075293
> @SQ SN:chr1_GL834415_random LN:174856
> @SQ SN:chr1_GL834416_random LN:817733
> @SQ SN:chr1_GL834417_random LN:3669695
> @SQ SN:chr1_GL834418_random LN:1117890
> @SQ SN:chr1_GL834419_random LN:2046010
>
> Merged file:
>
> [***@login-p1n01 02-Mapped]$ samtools view -H 165499.bam
> @SQ SN:chr1_GL834412_random LN:2180112
> @SQ SN:chr1_GL834413_random LN:373546
> @SQ SN:chr1_GL834414_random LN:1075293
> @SQ SN:chr1_GL834415_random LN:174856
> @SQ SN:chr1_GL834416_random LN:817733
> @SQ SN:chr1_GL834417_random LN:3669695
> @SQ SN:chr1_GL834418_random LN:1117890
> @SQ SN:chr1_GL834419_random LN:2046010
>
> Sorted file:
>
> [***@login-p1n01 02-Mapped]$ samtools view -H 165499sorted.bam
> @HD VN:1.3 SO:coordinate
> @SQ SN:chr1_GL834412_random LN:2180112
> @SQ SN:chr1_GL834413_random LN:373546
> @SQ SN:chr1_GL834414_random LN:1075293
> @SQ SN:chr1_GL834415_random LN:174856
> @SQ SN:chr1_GL834416_random LN:817733
> @SQ SN:chr1_GL834417_random LN:3669695
> @SQ SN:chr1_GL834418_random LN:1117890
> @SQ SN:chr1_GL834419_random LN:2046010
>
> Thanks.
>
> Mark
>
> ----------------------
> Mark J. Margres, Ph.D.
> Postdoc, Storfer Lab
> School of Biological Sciences
> Washington State University
> Pullman, WA 99164-4236
>
> ________________________________________
> From: Thomas W. Blackwell [***@umich.edu]
> Sent: Friday, May 12, 2017 9:50 AM
> To: Margres, Mark J
> Cc: samtools-***@lists.sourceforge.net
> Subject: Re: [Samtools-help] samtools merge issue
>
> Mark -
>
> Just guessing: One may need to specify "merge -f" in case an
> earlier invocation of the command has created an empty file with
> the path generated by "jp(bamFolder, sample + ".bam")". Also may
> need to specify "merge -n" if, in fact, the output from bwa is
> already properly name-sorted. Again, just guessing.
>
> - tom blackwell -
>
> On Fri, 12 May 2017, Margres, Mark J wrote:
>
>> Hello,
>>
>> I am having an issue with merging different bam assemblies using samtools. In my pipeline, I merge reads using flash2, and then perform reference-based assemblies using bwa, one assembly uses just the merged reads and a second assembly uses just the unmerged reads. I then attempt to merge the assemblies using samtools and get the following output:
>>
>>
>> [W::bam_merge_core2] No @HD tag found.
>>
>> I am assuming this is due to the unmerged read assembly having a different header than the merged reads, but I am not sure why (and also not sure if this is a bwa issue or a samtools issue). Is there a way for me to merge and then sort the assemblies with maintaining the correct header tags? I have the original assemblies to work with, but have assembled abut a dozen genomes and would prefer to not re-assemble if possible. Code is below. Thanks.
>>
>> # Run BWA to map PE samples to reference genome
>> # -t number of threads -R read group header
>> logFile = jp(resultsDir, sample + '_mapping.log')
>> cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
>> bwaIndex, jp(resultsDir, sample + "_cleaned_PE1.fastq.gz"),
>> jp(resultsDir, sample + "_cleaned_PE2.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "PE.bam"),
>> "2>", logFile])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # Run BWA to map SE samples to reference genome
>> cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50
-R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
>> bwaIndex, jp(resultsDir, sample + "_cleaned_SE.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "SE.bam"),
>> "2>>", logFile])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # merge and sort
>> cmd = ' '.join(['samtools merge -c', jp(bamFolder, sample + ".bam"), jp(bamFolder, sample + "PE.bam"), jp(bamFolder, sample + "SE.bam")])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # sort bam file; -@ number of threads
>> cmd = ' '.join(['samtools sort -o', jp(bamFolder, sample) + "sorted.bam", ' -@ 50', jp(bamFolder, sample + ".bam")])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # Mark and remove PCR duplicates
>> cmd = ' '.join([picard + "MarkDuplicates INPUT=" + jp(bamFolder, sample + "sorted.bam"), ' OUTPUT=' + jp(bamFolder, sample + "_markdup.bam"),
>> ' METRICS_FILE=' + jp(bamFolder, sample + ".metrics"), ' REMOVE_DUPLICATES=true ',
>> ' ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT', '>>', logFile, '2>&1'])
>> log(cmd, logCommands)
>> os.system(cmd)
>> # Index bam file:
>> cmd = ' '.join(['samtools index', jp(bamFolder, sample) + "_markdup.bam"])
>>
>> ----------------------
>> Mark J. Margres, Ph.D.
>> Postdoc, Storfer Lab
>> School of Biological Sciences
>> Washington State University
>> Pullman, WA 99164-4236
>>