Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redefine GQ #797

Open
d-cameron opened this issue Oct 10, 2024 · 3 comments
Open

Redefine GQ #797

d-cameron opened this issue Oct 10, 2024 · 3 comments
Labels

Comments

@d-cameron
Copy link
Contributor

d-cameron commented Oct 10, 2024

In light of #756 and further discussion with implementations, it appears that the current definition of GQ is problematic.

GQ (Integer): Conditional genotype quality, encoded as a phred quality −10log10 p(genotype call is wrong, conditioned on the site’s being variant)

The conditioned on the site’s being variant is problematic as, in practice, the unconditioned genotype quality is a much more meaningful value. If 0/0 is unlikely then it's unlikely to change the result by a full integer value, and if 0/0 is not unlikely, then by conditioning on the site being variant (vague language is which also problematic) then the GQ is discarding information that is actually important to variant interpretation. Furthermore, if the site is actually 0/0 then GQ is meaningless as p(!A | !A) is always one.

The question I have is, if we were to make this change, what's the best way to go about it? I see three approaches:

  • Implement in Vnext?
  • Errata into the latest version (4.5)?
  • Errata into all VCF version?

The most widely used variant callers do not follow the currently specifications (DRAGEN ignores the conditional, GATK compares against the next most likely genotype), but there may be variants callers out there than do and I don't want to penalise them for following the specs.

@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 28, 2024

I find the whole thing confusing. What is the conditional here when it says "site". Do we mean that coordinate is variable across our population, ie it is a variant in one of the samples (even if not present in this specific file), or is it explicitly that the site is a variant in this sample? This makes a huge difference to the 0/0 case and I think this is the nub of your question about GQ being meaningless for 0/0. It's ambiguous, but given GQ issues around 0/0 I assume it was intended to be a population conditional rather than sample specific conditional.

I'm pretty sure the intention was clear that for a call of 0/0, GQ is giving the likelihood that the 0/0 call is correct. You could reasonably argue that it is therefore unconditional call, but it could also be argued to be a conditional one where the condition is simply "if it's recorded then it's strongly likely to be a variable site" (which sounds compatible with GATK). That amounts to much the same thing frankly. Weaselly words, but avoiding spec changes too!

@jkbonfield
Copy link
Contributor

Adding concrete data to the discussion.

I looked at some gnomAD data and it contains things like this snippet:

GT:GQ:DP:AD:MIN_DP:PGT:PID:PL:SB 0/0:20:13:.:13:.:.:.:.  0/0:40:23:.:21:.:.:.:.  0/0:20:10:.:9:.:.:.:.   ./.:.:.:.:.:.:.:.:.     0/0:30:51:.:47:.:.:.:.   0/1:33:4:1,3:.:./.:.:123,0,33:0,1,0,3

Bcftools also gives me GQ of non-zero for 0/0:

$ bcftools mpileup -f test/mpileup/mpileup.ref.fa test/mpileup/mpileup.?.bam | bcftools call -vm -f GQ -
... GT:PL:AD:GQ     0/0:0,21,185:7,0:26     0/0:0,27,222:9,0:32     0/1:35,0,51:2,2:29

Clearly GQ in practice does have meaning for the 0/0 case. This is both justified and useful.

@jkbonfield jkbonfield added the vcf label Nov 5, 2024
@yfarjoun
Copy link
Contributor

I agree with @jkbonfield. a strong (~60) 0/0 GQ doesn't mean that the genotype is actually variable. it means that the caller asserts that by using the assumption that the site is variable (in the population) the evidence supports that this genotype is reference.

The dependent on site variability is required in order to avoid questions around non-variant sites that have non-reference alleles due to artifacts (mapping, LC and sequencing errors, etc.). In this way variant filtering can be distinct from genotype calling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants