[Samtools-help] Picard MarkDuplicates runtime/overview

Discussion:

Michael James Clark

2010-02-10 22:48:03 UTC

Hi,

I¹m working with Picard MarkDuplicates with a whole genome dataset for the
first time and it has been running for a very long time (almost 24 hours)
and still is not finished.

One of my files is 240Gb in size, and currently the program is outputting
the results. In this case, I set REMOVE_DUPLICATES=TRUE and
VALIDATION_STRINGENCY=SILENT and my file is pre-sorted.

I¹m more curious what the program is doing that is taking so long. Is there
a setting I should be using that might make it go faster?

Is there an overview of what the MarkDuplicates program does somewhere that
I could look at?

Thanks for your help,

--
Michael James Clark
Graduate Student
Dept. of Human Genetics, UCLA
Laboratory of Stanley Nelson
Gonda Bldg, Rm. 5554
Lab Ph. #: (310)825-7920
Cell Ph. #: (310)415-5207
Email: ***@ucla.edu

Keiran Raine

2010-02-10 23:03:15 UTC

Permalink

Hi Michael,

In my experience you can expect a file of 240GB to take a couple of
days to mark duplicates.

A couple of things though:
1. Did you sort your file with Picard or Samtools? Picard may decide
that the sort is not what it wants and re-sort it first (especially
for older versions of samtools).
2. The more memory you give it the faster it will run (up to a point,
we tend to use -Xmx5g, I'd also recommend setting -Xms5g so it fails
instantly if 5GB is not available).

Regards,

Keiran Raine
Senior Computer Biologist
The Cancer Genome Project
Ext: 2100
***@sanger.ac.uk

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Hi,
Im working with Picard MarkDuplicates with a whole genome dataset
for the first time and it has been running for a very long time
(almost 24 hours) and still is not finished.
One of my files is 240Gb in size, and currently the program is
outputting the results. In this case, I set REMOVE_DUPLICATES=TRUE
and VALIDATION_STRINGENCY=SILENT and my file is pre-sorted.
Im more curious what the program is doing that is taking so long.
Is there a setting I should be using that might make it go faster?
Is there an overview of what the MarkDuplicates program does
somewhere that I could look at?
Thanks for your help,
--
Michael James Clark
Graduate Student
Dept. of Human Genetics, UCLA
Laboratory of Stanley Nelson
Gonda Bldg, Rm. 5554
Lab Ph. #: (310)825-7920
Cell Ph. #: (310)415-5207
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as
DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev_______________________________________________
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help

--
The Wellcome Trust Sanger Institute is operated by Genome Research

Limited, a charity registered in England with number 1021457 and a
compa
ny registered in England with number 2742969, whose registered
office is 2
15 Euston Road, London, NW1 2BE.

Michael James Clark

2010-02-10 23:18:12 UTC

Permalink

Hi Keiran,

Thanks for your input.

I think that particular file was sorted with Samtools (it¹s a little older
file), but I¹m testing it with files sorted with Picard, so we¹ll see if it
goes faster.

I¹m also going ahead and trying to use 24Gb of RAM with it. I¹ll let you
know how fast it goes.

Michael

Post by Keiran Raine
Hi Michael,
In my experience you can expect a file of 240GB to take a couple of days to
mark duplicates.
1. Did you sort your file with Picard or Samtools? Picard may decide that the
sort is not what it wants and re-sort it first (especially for older versions
of samtools).
2. The more memory you give it the faster it will run (up to a point, we tend
to use -Xmx5g, I'd also recommend setting -Xms5g so it fails instantly if 5GB
is not available).
Regards,
Keiran Raine
Senior Computer Biologist
The Cancer Genome Project
Ext: 2100
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

----------------------------------------------------------------------------->>
-

Post by Keiran Raine

Post by Michael James Clark
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev__________________________________________
_____
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help

-- The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a company
registered in England with number 2742969, whose registered office is 215
Euston Road, London, NW1 2BE.

Tim Fennell

2010-02-11 14:32:13 UTC

Permalink

Hi Michael,

Unfortunately it is pretty slow, and I'm not sure there's a huge amount that can be done about it. As Keiran suggests, more memory and setting VALIDATION_STRINGENCY=SILENT will both help.

The main reason is that if you want to catch all duplicates, including cross-chromosomal ones, you really have to make multiple passes over the file. The first one accumulates information about each read pair (it's 5' coordinates, orientations etc.) and sorts that into the order necessary to do the duplicate detection. This is a disk based sort, using whatever directory is set as your TMP_DIR. It then traverses this information and builds a list of reads to be marked as duplicates - which involves another disk based sort as this list can potentially get very large for large files and cannot just be held in memory.

Finally, once it has figured out which reads to mark as duplicates it makes a second pass over the input BAM file, marking or removing reads as requested, and then writes out the final BAM file.

The only other suggestion I can make is to make sure that you set the TMP_DIR parameter to the fastest available filesystem you have that has plenty of space on it, as this will help too.

-t

Post by Michael James Clark
Hi Keiran,
Thanks for your input.
I think that particular file was sorted with Samtools (its a little older file), but Im testing it with files sorted with Picard, so well see if it goes faster.
Im also going ahead and trying to use 24Gb of RAM with it. Ill let you know how fast it goes.
Michael

Post by Keiran Raine
Hi Michael,
In my experience you can expect a file of 240GB to take a couple of days to mark duplicates.
1. Did you sort your file with Picard or Samtools? Picard may decide that the sort is not what it wants and re-sort it first (especially for older versions of samtools).
2. The more memory you give it the faster it will run (up to a point, we tend to use -Xmx5g, I'd also recommend setting -Xms5g so it fails instantly if 5GB is not available).
Regards,
Keiran Raine
Senior Computer Biologist
The Cancer Genome Project
Ext: 2100
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Hi,
Im working with Picard MarkDuplicates with a whole genome dataset for the first time and it has been running for a very long time (almost 24 hours) and still is not finished.
One of my files is 240Gb in size, and currently the program is outputting the results. In this case, I set REMOVE_DUPLICATES=TRUE and VALIDATION_STRINGENCY=SILENT and my file is pre-sorted.
Im more curious what the program is doing that is taking so long. Is there a setting I should be using that might make it go faster?
Is there an overview of what the MarkDuplicates program does somewhere that I could look at?
Thanks for your help,
--
Michael James Clark
Graduate Student
Dept. of Human Genetics, UCLA
Laboratory of Stanley Nelson
Gonda Bldg, Rm. 5554
Lab Ph. #: (310)825-7920
Cell Ph. #: (310)415-5207
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev_______________________________________________
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help

-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev_______________________________________________
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help

Michael James Clark

2010-02-11 22:42:54 UTC

Permalink

Thanks so much for the explanation, Tim. Very helpful.

Are there plans to make it multi-threaded? Our 32Gb nodes are 8-core
machines (probably true for most groups).

As for a few numbers:

A 254Gb file using 8Gb of RAM took 39 hours.
A 240Gb file using 8Gb of RAM took 37 hours.
A 118Gb file using 24Gb of RAM took 21 hours.
A 87Gb file using 24Gb of RAM took 15 hours.
A 97Gb file using 24Gb of RAM took 15 hours.
Another 97Gb file using 24Gb of RAM took 19 hours.

Not much informative here except that it does indeed take a while to run
even giving it a decent chunk of RAM to play with. Maybe a solid-state drive
for this purpose would also help.

I¹ll advise my group to remove duplicates from different libraries of the
same sample prior to merging them from here on out as well so we can at
least parallelize it in that manner.

Sincerely,
Michael

Post by Keiran Raine
Hi Michael,
Unfortunately it is pretty slow, and I'm not sure there's a huge amount that
can be done about it. As Keiran suggests, more memory and setting
VALIDATION_STRINGENCY=SILENT will both help.
The main reason is that if you want to catch all duplicates, including
cross-chromosomal ones, you really have to make multiple passes over the file.
The first one accumulates information about each read pair (it's 5'
coordinates, orientations etc.) and sorts that into the order necessary to do
the duplicate detection. This is a disk based sort, using whatever directory
is set as your TMP_DIR. It then traverses this information and builds a list
of reads to be marked as duplicates - which involves another disk based sort
as this list can potentially get very large for large files and cannot just be
held in memory.
Finally, once it has figured out which reads to mark as duplicates it makes a
second pass over the input BAM file, marking or removing reads as requested,
and then writes out the final BAM file.
The only other suggestion I can make is to make sure that you set the TMP_DIR
parameter to the fastest available filesystem you have that has plenty of
space on it, as this will help too.
-t

Post by Michael James Clark
Hi Keiran,
Thanks for your input.
I think that particular file was sorted with Samtools (it¹s a little older
file), but I¹m testing it with files sorted with Picard, so we¹ll see if it
goes faster.
I¹m also going ahead and trying to use 24Gb of RAM with it. I¹ll let you know
how fast it goes.
Michael

Post by Keiran Raine
Hi Michael,
In my experience you can expect a file of 240GB to take a couple of days to
mark duplicates.
1. Did you sort your file with Picard or Samtools? Picard may decide that
the sort is not what it wants and re-sort it first (especially for older
versions of samtools).
2. The more memory you give it the faster it will run (up to a point, we
tend to use -Xmx5g, I'd also recommend setting -Xms5g so it fails instantly
if 5GB is not available).
Regards,
Keiran Raine
Senior Computer Biologist
The Cancer Genome Project
Ext: 2100
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Post by Michael James Clark
Hi,
I¹m working with Picard MarkDuplicates with a whole genome dataset for the
first time and it has been running for a very long time (almost 24 hours)
and still is not finished.
One of my files is 240Gb in size, and currently the program is outputting
the results. In this case, I set REMOVE_DUPLICATES=TRUE and
VALIDATION_STRINGENCY=SILENT and my file is pre-sorted.
I¹m more curious what the program is doing that is taking so long. Is
there a setting I should be using that might make it go faster?
Is there an overview of what the MarkDuplicates program does somewhere
that I could look at?
Thanks for your help,
--
Michael James Clark
Graduate Student
Dept. of Human Genetics, UCLA
Laboratory of Stanley Nelson
Gonda Bldg, Rm. 5554
Lab Ph. #: (310)825-7920
Cell Ph. #: (310)415-5207
---------------------------------------------------------------------------
---
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev________________________________________
_______
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help

-- The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a company
registered in England with number 2742969, whose registered office is 215
Euston Road, London, NW1 2BE.

----------------------------------------------------------------------------->>
-

Post by Keiran Raine

Tim Fennell

2010-02-12 14:41:41 UTC

Permalink

Hi Michael,

I have no immediate plans to make it multi-threaded. We have a similar computing environment, but tend to submit these jobs to use 4GB of RAM each and one CPU core. I think that in order to get a lot of out the multi-threading it might require threading the I/O and BAM decoding. When I've looked at this previously a lot of the CPU time was spent in the gzip/gunzip routines that are used to decode and encode BAM records.

-t

Post by Michael James Clark
Thanks so much for the explanation, Tim. Very helpful.
Are there plans to make it multi-threaded? Our 32Gb nodes are 8-core machines (probably true for most groups).
A 254Gb file using 8Gb of RAM took 39 hours.
A 240Gb file using 8Gb of RAM took 37 hours.
A 118Gb file using 24Gb of RAM took 21 hours.
A 87Gb file using 24Gb of RAM took 15 hours.
A 97Gb file using 24Gb of RAM took 15 hours.
Another 97Gb file using 24Gb of RAM took 19 hours.
Not much informative here except that it does indeed take a while to run even giving it a decent chunk of RAM to play with. Maybe a solid-state drive for this purpose would also help.
Ill advise my group to remove duplicates from different libraries of the same sample prior to merging them from here on out as well so we can at least parallelize it in that manner.
Sincerely,
Michael

Post by Keiran Raine
Hi Michael,
Unfortunately it is pretty slow, and I'm not sure there's a huge amount that can be done about it. As Keiran suggests, more memory and setting VALIDATION_STRINGENCY=SILENT will both help.
The main reason is that if you want to catch all duplicates, including cross-chromosomal ones, you really have to make multiple passes over the file. The first one accumulates information about each read pair (it's 5' coordinates, orientations etc.) and sorts that into the order necessary to do the duplicate detection. This is a disk based sort, using whatever directory is set as your TMP_DIR. It then traverses this information and builds a list of reads to be marked as duplicates - which involves another disk based sort as this list can potentially get very large for large files and cannot just be held in memory.
Finally, once it has figured out which reads to mark as duplicates it makes a second pass over the input BAM file, marking or removing reads as requested, and then writes out the final BAM file.
The only other suggestion I can make is to make sure that you set the TMP_DIR parameter to the fastest available filesystem you have that has plenty of space on it, as this will help too.
-t

Post by Keiran Raine
Hi Michael,
In my experience you can expect a file of 240GB to take a couple of days to mark duplicates.
1. Did you sort your file with Picard or Samtools? Picard may decide that the sort is not what it wants and re-sort it first (especially for older versions of samtools).
2. The more memory you give it the faster it will run (up to a point, we tend to use -Xmx5g, I'd also recommend setting -Xms5g so it fails instantly if 5GB is not available).
Regards,
Keiran Raine
Senior Computer Biologist
The Cancer Genome Project
Ext: 2100
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Hi,
Im working with Picard MarkDuplicates with a whole genome dataset for the first time and it has been running for a very long time (almost 24 hours) and still is not finished.
One of my files is 240Gb in size, and currently the program is outputting the results. In this case, I set REMOVE_DUPLICATES=TRUE and VALIDATION_STRINGENCY=SILENT and my file is pre-sorted.
Im more curious what the program is doing that is taking so long. Is there a setting I should be using that might make it go faster?
Is there an overview of what the MarkDuplicates program does somewhere that I could look at?
Thanks for your help,
--
Michael James Clark
Graduate Student
Dept. of Human Genetics, UCLA
Laboratory of Stanley Nelson
Gonda Bldg, Rm. 5554
Lab Ph. #: (310)825-7920
Cell Ph. #: (310)415-5207
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev_______________________________________________
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help

-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.