Urs -
Just guessing, since I haven't actually read this part of the
samtools code. It's likely that the model used for genotype
calling with mutliple samples is that these samples are a
sample of unrelated individuals from a single, homogeneous,
randomly mating population. As a consequence, Hardy-Weinberg
proportions for genotypes at each locus are expected, and it
makes sense to estimate the population allele frequency from
the sample provided and to use an allele frequency prior for
genotype calling. This will strongly downweight hom-alt
genotype calls at a site (like this one) with very low estimated
allele frequency.
It's a bit clunky, but here's what I would do in order to avoid
this behavior. I would make a file showing the list of all sites
where a variant was called in the multi-sample calling that you've
already done. Then I would re-run mpileup on each sample one at
a time, supplying the site list with argument '-l' to get single
sample genotype calls. Then merge all 13 single sample .bcf or
.vcf files using bcftools. When calling a single sample showing
an apparent hom-alt genotype, the apparent allele frequency will
be 1.0 and there will be strong SUPPORT for the hom-alt genotype,
rather than population evidence against it.
That said, I not recommend doing analysis using the discrete
genotype calls from a region with only 3 or 4 covering reads.
Analysis should be done using the genotype likelihoods themselves.
- tom blackwell -
Hi,
I have used mpileup on a set of 13 samples and find occasionally a predicted GT that imo does not fit to the provided PL's.
So far I have seen this only in low coverage regions and I haven't analysed it systematically. Nevertheless I would like to understand what's
going on or if I miss something.
chr1 19202896 0 G A 65 PASS
DP=36;VDB=0.820274;SGB=6.18608;RPB=0.894839;MQB=1;MQSB=1;BQB=0.972604;MQ0F=0;ICB=0.0072327;HOB=0.00347222;AC=1;AN=24;DP4=14,13,0,4;MQ=60
GT:PL:DP:ADF:ADR:AD
0/0:0,3,60:1:1,0:0,0:1,0 0/0:0,3,33:1:0,0:1,0:1,0 0/0:0,12,182:4:3,0:1,0:4,0 0/0:0,3,37:1:1,0:0,0:1,0 0/0:0,6,64:2:2,0:0,0:2,0
./.:0,0,0:0:0,0:0,0:0,0 ./.:0,0,0:0:0,0:0,0:0,0 0/0:0,12,135:4:2,0:2,0:4,0 0/0:0,6,67:2:2,0:0,0:2,0 0/0:0,9,78:3:0,0:3,0:3,0
0/0:0,9,76:3:0,0:3,0:3,0 0/0:0,12,112:4:2,0:2,0:4,0 0/1:113,12,0:4:0,0:0,4:0,4 0/0:0,6,97:2:1,0:1,0:2,0
So, the second before last GT ist het (0/1), but PL is "113, 12, 0" and with that should be 1/1?!
The only explanation I currently have is: a present, yet not sequenced, reference allele is implied because of other samples are hom ref. If
this is true, please tell me which parameter turns it off as it would be stupid to use such a setting with my current sample collective. If it
is something else, please enlighten me :)
Best regards,
Urs