Margres, Mark J
2017-05-12 16:08:13 UTC
Hello,
I am having an issue with merging different bam assemblies using samtools. In my pipeline, I merge reads using flash2, and then perform reference-based assemblies using bwa, one assembly uses just the merged reads and a second assembly uses just the unmerged reads. I then attempt to merge the assemblies using samtools and get the following output:
[W::bam_merge_core2] No @HD tag found.
I am assuming this is due to the unmerged read assembly having a different header than the merged reads, but I am not sure why (and also not sure if this is a bwa issue or a samtools issue). Is there a way for me to merge and then sort the assemblies with maintaining the correct header tags? I have the original assemblies to work with, but have assembled abut a dozen genomes and would prefer to not re-assemble if possible. Code is below. Thanks.
# Run BWA to map PE samples to reference genome
# -t number of threads -R read group header
logFile = jp(resultsDir, sample + '_mapping.log')
cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
bwaIndex, jp(resultsDir, sample + "_cleaned_PE1.fastq.gz"),
jp(resultsDir, sample + "_cleaned_PE2.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "PE.bam"),
"2>", logFile])
log(cmd, logCommands)
os.system(cmd)
# Run BWA to map SE samples to reference genome
cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
bwaIndex, jp(resultsDir, sample + "_cleaned_SE.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "SE.bam"),
"2>>", logFile])
log(cmd, logCommands)
os.system(cmd)
# merge and sort
cmd = ' '.join(['samtools merge -c', jp(bamFolder, sample + ".bam"), jp(bamFolder, sample + "PE.bam"), jp(bamFolder, sample + "SE.bam")])
log(cmd, logCommands)
os.system(cmd)
# sort bam file; -@ number of threads
cmd = ' '.join(['samtools sort -o', jp(bamFolder, sample) + "sorted.bam", ' -@ 50', jp(bamFolder, sample + ".bam")])
log(cmd, logCommands)
os.system(cmd)
# Mark and remove PCR duplicates
cmd = ' '.join([picard + "MarkDuplicates INPUT=" + jp(bamFolder, sample + "sorted.bam"), ' OUTPUT=' + jp(bamFolder, sample + "_markdup.bam"),
' METRICS_FILE=' + jp(bamFolder, sample + ".metrics"), ' REMOVE_DUPLICATES=true ',
' ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT', '>>', logFile, '2>&1'])
log(cmd, logCommands)
os.system(cmd)
# Index bam file:
cmd = ' '.join(['samtools index', jp(bamFolder, sample) + "_markdup.bam"])
----------------------
Mark J. Margres, Ph.D.
Postdoc, Storfer Lab
School of Biological Sciences
Washington State University
Pullman, WA 99164-4236
I am having an issue with merging different bam assemblies using samtools. In my pipeline, I merge reads using flash2, and then perform reference-based assemblies using bwa, one assembly uses just the merged reads and a second assembly uses just the unmerged reads. I then attempt to merge the assemblies using samtools and get the following output:
[W::bam_merge_core2] No @HD tag found.
I am assuming this is due to the unmerged read assembly having a different header than the merged reads, but I am not sure why (and also not sure if this is a bwa issue or a samtools issue). Is there a way for me to merge and then sort the assemblies with maintaining the correct header tags? I have the original assemblies to work with, but have assembled abut a dozen genomes and would prefer to not re-assemble if possible. Code is below. Thanks.
# Run BWA to map PE samples to reference genome
# -t number of threads -R read group header
logFile = jp(resultsDir, sample + '_mapping.log')
cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
bwaIndex, jp(resultsDir, sample + "_cleaned_PE1.fastq.gz"),
jp(resultsDir, sample + "_cleaned_PE2.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "PE.bam"),
"2>", logFile])
log(cmd, logCommands)
os.system(cmd)
# Run BWA to map SE samples to reference genome
cmd = ' '.join(["/home/mark.margres/bwa-0.7.15/bwa mem -t 50 -R '@RG\tID:bwa\tSM:" + sample + "\tPL:ILLUMINA'",
bwaIndex, jp(resultsDir, sample + "_cleaned_SE.fastq.gz"), "| samtools view -bS -@ 50 -o", jp(bamFolder, sample + "SE.bam"),
"2>>", logFile])
log(cmd, logCommands)
os.system(cmd)
# merge and sort
cmd = ' '.join(['samtools merge -c', jp(bamFolder, sample + ".bam"), jp(bamFolder, sample + "PE.bam"), jp(bamFolder, sample + "SE.bam")])
log(cmd, logCommands)
os.system(cmd)
# sort bam file; -@ number of threads
cmd = ' '.join(['samtools sort -o', jp(bamFolder, sample) + "sorted.bam", ' -@ 50', jp(bamFolder, sample + ".bam")])
log(cmd, logCommands)
os.system(cmd)
# Mark and remove PCR duplicates
cmd = ' '.join([picard + "MarkDuplicates INPUT=" + jp(bamFolder, sample + "sorted.bam"), ' OUTPUT=' + jp(bamFolder, sample + "_markdup.bam"),
' METRICS_FILE=' + jp(bamFolder, sample + ".metrics"), ' REMOVE_DUPLICATES=true ',
' ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT', '>>', logFile, '2>&1'])
log(cmd, logCommands)
os.system(cmd)
# Index bam file:
cmd = ' '.join(['samtools index', jp(bamFolder, sample) + "_markdup.bam"])
----------------------
Mark J. Margres, Ph.D.
Postdoc, Storfer Lab
School of Biological Sciences
Washington State University
Pullman, WA 99164-4236