-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redefine GQ #797
Comments
I find the whole thing confusing. What is the conditional here when it says "site". Do we mean that coordinate is variable across our population, ie it is a variant in one of the samples (even if not present in this specific file), or is it explicitly that the site is a variant in this sample? This makes a huge difference to the 0/0 case and I think this is the nub of your question about GQ being meaningless for 0/0. It's ambiguous, but given GQ issues around 0/0 I assume it was intended to be a population conditional rather than sample specific conditional. I'm pretty sure the intention was clear that for a call of 0/0, GQ is giving the likelihood that the 0/0 call is correct. You could reasonably argue that it is therefore unconditional call, but it could also be argued to be a conditional one where the condition is simply "if it's recorded then it's strongly likely to be a variable site" (which sounds compatible with GATK). That amounts to much the same thing frankly. Weaselly words, but avoiding spec changes too! |
Adding concrete data to the discussion. I looked at some gnomAD data and it contains things like this snippet:
Bcftools also gives me GQ of non-zero for 0/0:
Clearly GQ in practice does have meaning for the 0/0 case. This is both justified and useful. |
I agree with @jkbonfield. a strong (~60) 0/0 GQ doesn't mean that the genotype is actually variable. it means that the caller asserts that by using the assumption that the site is variable (in the population) the evidence supports that this genotype is reference. The dependent on site variability is required in order to avoid questions around non-variant sites that have non-reference alleles due to artifacts (mapping, LC and sequencing errors, etc.). In this way variant filtering can be distinct from genotype calling. |
In light of #756 and further discussion with implementations, it appears that the current definition of GQ is problematic.
GQ (Integer): Conditional genotype quality, encoded as a phred quality −10log10 p(genotype call is wrong, conditioned on the site’s being variant)
The
conditioned on the site’s being variant
is problematic as, in practice, the unconditioned genotype quality is a much more meaningful value. If 0/0 is unlikely then it's unlikely to change the result by a full integer value, and if 0/0 is not unlikely, then by conditioning on the site being variant (vague language is which also problematic) then the GQ is discarding information that is actually important to variant interpretation. Furthermore, if the site is actually 0/0 then GQ is meaningless as p(!A | !A) is always one.The question I have is, if we were to make this change, what's the best way to go about it? I see three approaches:
The most widely used variant callers do not follow the currently specifications (DRAGEN ignores the conditional, GATK compares against the next most likely genotype), but there may be variants callers out there than do and I don't want to penalise them for following the specs.
The text was updated successfully, but these errors were encountered: