You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having been confused by this just now and trying to find out which of 0 and 255 represents a likely event in VCF PL fields, I think we need to be more careful in our wording. Specifically "scale" meaning the logarithmic scale, but not the broad sweeping view that high-phred is high-quality and low-phred is low-quality as defined in the phred program. Our scales are sometimes reversed.
On the basis of the trace characteristics, phred computes a probability p of an error in the base call at each position, and converts this to a quality value q using the transformation q = -10 log_10(p).
Note here "probability p of an error".
In SAM we define:
\item[Phred scale] Given a probability 0<p<=1, the phred scale of p equals -10 log_10(p), rounded to the closest integer.
So here we're saying "scale" meaning 10log10, but not defining at all what p is; specifically whether it's the probability of being true or of being false. We're simply defining the concept of a logarithmic scale. We then later on use the term "phred-scaled base QUALity", which may trip up the unwary reader, until they read the more detailed explanation a little later which then reverses the formula by explaining that p is the error probability.
Obviously I know how it works and in my brain (wrong!) "phred" always means P(error), but it should be spelt out earlier. I think we'd be better off describing what we want from the outset given AFAIK we never use phred-scale in SAM for anything other than errors. Eg:
"Phred-scaled quality: Given a probability 0<p<=1 of an event being correct, the phred-scaled quality q is defined as q = -10 log_10(1-p) rounded to the closest integer."
In VCF we define:
\item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e. -10log_10(prob(call in ALT is wrong)).
...
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality -10log_10( p(genotype call is wrong, conditioned on the site's being variant).
Both good and matches the Phrap documentation and my suggested revision for SAM spec too.
However then also in VCF:
\item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined precisely as the GL field.
\item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set).
An incredibly subtle distinction between those two lines: "phred-scaled genotype likelihood" vs "phred-scaled probability that alleles are ordered incorrectly". You have to read carefully to see that is p vs 1-p.
For consistency PQ maybe should be rephrased as "phred-scaled quality score" given it's 1-p and is the phrase we're using elsewhere.
Ultimately this is too subtle for people not versed in the art of decoding VCF to understand. The only formula listed in VCF documenting the phred scale is for p(X is wrong), but whenever it uses the term "scale" it is referring to the unwritten formula of p(X is correct) unless explicitly stated otherwise.
For PL we really ought to be explicit here. Saying "phred-scaled", while technically true, conjures up in my mind at least the standard of phred encoding errors, which is this case is wrong. At the very least the sentence should be followed by the explicit formula, and maybe also explaining the relationship to GL verbatim (from @pd3 personal communication: "In other words, PL = int(-10*GL)".
Edit: actually I realise this "in other words" is wrong. PL is not just -10*GL. It's then normalised so the most likely event has PL 0. This isn't true for GL.
The text was updated successfully, but these errors were encountered:
Having been confused by this just now and trying to find out which of 0 and 255 represents a likely event in VCF PL fields, I think we need to be more careful in our wording. Specifically "scale" meaning the logarithmic scale, but not the broad sweeping view that high-phred is high-quality and low-phred is low-quality as defined in the phred program. Our scales are sometimes reversed.
Going to source: (https://genetics.mgh.harvard.edu/goodman/doc/phrap.html)
Note here "probability p of an error".
In SAM we define:
So here we're saying "scale" meaning 10log10, but not defining at all what
p
is; specifically whether it's the probability of being true or of being false. We're simply defining the concept of a logarithmic scale. We then later on use the term "phred-scaled base QUALity", which may trip up the unwary reader, until they read the more detailed explanation a little later which then reverses the formula by explaining that p is the error probability.Obviously I know how it works and in my brain (wrong!) "phred" always means P(error), but it should be spelt out earlier. I think we'd be better off describing what we want from the outset given AFAIK we never use phred-scale in SAM for anything other than errors. Eg:
"Phred-scaled quality: Given a probability 0<p<=1 of an event being correct, the phred-scaled quality q is defined as q = -10 log_10(1-p) rounded to the closest integer."
In VCF we define:
Both good and matches the Phrap documentation and my suggested revision for SAM spec too.
However then also in VCF:
An incredibly subtle distinction between those two lines: "phred-scaled genotype likelihood" vs "phred-scaled probability that alleles are ordered incorrectly". You have to read carefully to see that is p vs 1-p.
For consistency PQ maybe should be rephrased as "phred-scaled quality score" given it's 1-p and is the phrase we're using elsewhere.
Ultimately this is too subtle for people not versed in the art of decoding VCF to understand. The only formula listed in VCF documenting the phred scale is for p(X is wrong), but whenever it uses the term "scale" it is referring to the unwritten formula of p(X is correct) unless explicitly stated otherwise.
For PL we really ought to be explicit here. Saying "phred-scaled", while technically true, conjures up in my mind at least the standard of phred encoding errors, which is this case is wrong. At the very least the sentence should be followed by the explicit formula, and maybe also explaining the relationship to GL verbatim (from @pd3 personal communication: "In other words, PL = int(-10*GL)".
Edit: actually I realise this "in other words" is wrong. PL is not just -10*GL. It's then normalised so the most likely event has PL 0. This isn't true for GL.
The text was updated successfully, but these errors were encountered: