Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phred scales vs qualities/scores #534

Open
jkbonfield opened this issue Dec 8, 2020 · 0 comments · May be fixed by #579
Open

Phred scales vs qualities/scores #534

jkbonfield opened this issue Dec 8, 2020 · 0 comments · May be fixed by #579

Comments

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 8, 2020

Having been confused by this just now and trying to find out which of 0 and 255 represents a likely event in VCF PL fields, I think we need to be more careful in our wording. Specifically "scale" meaning the logarithmic scale, but not the broad sweeping view that high-phred is high-quality and low-phred is low-quality as defined in the phred program. Our scales are sometimes reversed.

Going to source: (https://genetics.mgh.harvard.edu/goodman/doc/phrap.html)

On the basis of the trace characteristics, phred computes a probability p of an error in the base call at each position, and converts this to a quality value q using the transformation q = -10 log_10(p).

Note here "probability p of an error".

In SAM we define:

\item[Phred scale] Given a probability 0<p<=1, the phred scale of p equals -10 log_10(p), rounded to the closest integer.

So here we're saying "scale" meaning 10log10, but not defining at all what p is; specifically whether it's the probability of being true or of being false. We're simply defining the concept of a logarithmic scale. We then later on use the term "phred-scaled base QUALity", which may trip up the unwary reader, until they read the more detailed explanation a little later which then reverses the formula by explaining that p is the error probability.

Obviously I know how it works and in my brain (wrong!) "phred" always means P(error), but it should be spelt out earlier. I think we'd be better off describing what we want from the outset given AFAIK we never use phred-scale in SAM for anything other than errors. Eg:

"Phred-scaled quality: Given a probability 0<p<=1 of an event being correct, the phred-scaled quality q is defined as q = -10 log_10(1-p) rounded to the closest integer."

In VCF we define:

\item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e. -10log_10(prob(call in ALT is wrong)).
...
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality -10log_10( p(genotype call is wrong, conditioned on the site's being variant).

Both good and matches the Phrap documentation and my suggested revision for SAM spec too.

However then also in VCF:

\item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined precisely as the GL field.
\item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).

An incredibly subtle distinction between those two lines: "phred-scaled genotype likelihood" vs "phred-scaled probability that alleles are ordered incorrectly". You have to read carefully to see that is p vs 1-p.

For consistency PQ maybe should be rephrased as "phred-scaled quality score" given it's 1-p and is the phrase we're using elsewhere.

Ultimately this is too subtle for people not versed in the art of decoding VCF to understand. The only formula listed in VCF documenting the phred scale is for p(X is wrong), but whenever it uses the term "scale" it is referring to the unwritten formula of p(X is correct) unless explicitly stated otherwise.

For PL we really ought to be explicit here. Saying "phred-scaled", while technically true, conjures up in my mind at least the standard of phred encoding errors, which is this case is wrong. At the very least the sentence should be followed by the explicit formula, and maybe also explaining the relationship to GL verbatim (from @pd3 personal communication: "In other words, PL = int(-10*GL)".

Edit: actually I realise this "in other words" is wrong. PL is not just -10*GL. It's then normalised so the most likely event has PL 0. This isn't true for GL.

@jkbonfield jkbonfield linked a pull request Jun 16, 2021 that will close this issue
@jkbonfield jkbonfield moved this to To do (backlog) in GA4GH File Formats Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant