Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.phased not correct for VCFs containing phased & unphased genotypes. #7

Open
rollf opened this issue Oct 1, 2024 · 1 comment
Open

Comments

@rollf
Copy link

rollf commented Oct 1, 2024

Hi,

I have a VCF that was processed with HiPhase. The file contains lines with "|" as well as "/" for the GT value. Example:

chrX	1189233	.	T	A	7.4	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	0|1:7:4:1,3:0.75:6,0,19:1189207
chrX	1189258	.	G	A	30.7	PASS	.	GT:GQ:DP:AD:VAF:PL	1/1:13:4:0,4:1:30,12,0

nft-vcf incorrectly assumes that the file is not phased. Looking into the code I realize that a single slash in any line is sufficient to set phased to false (it's true by default and gets updated when reading the file). I rather think it should be the other way around: A single pipe ("|") should be sufficient to assume the VCF is phased (and set phased to false by default) - assuming the code stays roughly as is.

Further thoughts: The VCF 4.2 spec says that a PS field should be added for phased variants. Maybe one could check for this field instead of looping over the variants. The current approach assumes that a single unphased genotype means the whole VCF is unphased. That seems too strict (or even wrong). If the phasing information is something that is expected to change on a per-variant base, you might rather want to have a phased boolean array with a value for each variant.

@seppinho
Copy link
Owner

Thank you for bringing this up! In our primary use case (genotype imputation from microarray data), a single pipe isn’t sufficient to ensure the file is correctly phased. However, I agree that supporting a single-pipe configuration could be useful for other workflows.

If you’re interested, we’d love to see a Pull Request (PR) from you to implement this feature. My suggestion would be to add a new parameter to the load function to handle an alternative strategy.

See here:

public static VcfFile load(Path vcfFilename) throws IOException {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants