Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indels can cause spurious site production #27

Open
bricoletc opened this issue Oct 5, 2021 · 2 comments
Open

Indels can cause spurious site production #27

bricoletc opened this issue Oct 5, 2021 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@bricoletc
Copy link
Member

bricoletc commented Oct 5, 2021

This is linked to #15 as it leads to production of ambiguous prgs (multiple paths give rise to same sequence)

Clustering code can conclude that there is no meaningful clustering of a set of sequences- e.g. puts each sequence in one cluster.

However the code that calls clustering can re-run prg-building for sets of sequences that only differ in alignment (i.e. gap positioning), not sequence (here)

This leads to spurious 'nested variants' and the following pathological example case:
msp6_ambig.pdf

Of the 4 paths between nodes labeled 55 and 56, two are identical. This cause gramtools to mis-genotype down the line.

I have a fix that I'm implementing and will PR in

@bricoletc bricoletc self-assigned this Oct 5, 2021
@bricoletc bricoletc added the bug Something isn't working label Oct 5, 2021
bricoletc added a commit that referenced this issue Oct 5, 2021
Change the result of main clustering function to express the option of
no meaningful clustering

Adapt the unit tests accordingly
bricoletc added a commit that referenced this issue Oct 5, 2021
Add tests to improve coverage- especially, add integration tests for
nested SNPs
bricoletc added a commit that referenced this issue Oct 5, 2021
Fix the issue, by only recursively building PRG when the sequence clustering
is meaningful
@bricoletc
Copy link
Member Author

Some good news, at the point of commit 68e032c (branch fix_27) I obtain the following number of AMBIG calls in a set of 14 P. falciparum validation samples:

../pf6_26_genes_7_13/Pf7G8/final.vcf.gz 4
../pf6_26_genes_7_13/PfCD01/final.vcf.gz        4
../pf6_26_genes_7_13/PfDd2/final.vcf.gz 3
../pf6_26_genes_7_13/PfGA01/final.vcf.gz        5
../pf6_26_genes_7_13/PfGB4/final.vcf.gz 4
../pf6_26_genes_7_13/PfGN01/final.vcf.gz        4
../pf6_26_genes_7_13/PfHB3/final.vcf.gz 4
../pf6_26_genes_7_13/PfIT/final.vcf.gz  4
../pf6_26_genes_7_13/PfKE01/final.vcf.gz        3
../pf6_26_genes_7_13/PfKH01/final.vcf.gz        5
../pf6_26_genes_7_13/PfKH02/final.vcf.gz        4
../pf6_26_genes_7_13/PfSD01/final.vcf.gz        3
../pf6_26_genes_7_13/PfSN01/final.vcf.gz        4
../pf6_26_genes_7_13/PfTG01/final.vcf.gz        4

Compared to tip of master:

./Pf7G8/final.vcf.gz    410              
./PfCD01/final.vcf.gz   396                      
./PfDd2/final.vcf.gz    379              
./PfGA01/final.vcf.gz   410                      
./PfGB4/final.vcf.gz    401              
./PfGN01/final.vcf.gz   390                      
./PfHB3/final.vcf.gz    362              
./PfIT/final.vcf.gz     383              
./PfKE01/final.vcf.gz   398                      
./PfKH01/final.vcf.gz   393                      
./PfKH02/final.vcf.gz   408                      
./PfSD01/final.vcf.gz   397                      
./PfSN01/final.vcf.gz   403                      
./PfTG01/final.vcf.gz   416 

I'm still evaluating the accuracy of calls on the new prgs, currently it looks like it improves some of my genes, but does worse in others. I'm investigating the worse ones

@bricoletc bricoletc mentioned this issue Oct 14, 2021
@bricoletc
Copy link
Member Author

Logging here another fix implemented in #28

imagine the msa program creates following alignment:

ATTTTTTA
A--TTTTA
ATTTT--A

the last two sequences are the same, but the msa program somehow aligned them differently, the two alignments are equally valid i believe.
then the attached graph can be created, in which paths 0-3 and 1-2 are the same seq (ATTTTA): dotgraph.pdf

NB: this is a MWE, but its not a theoretical point, this happens in practice in my pf data, albeit on longer super indel and repe
at rich alignments (in gene MSP6)

My fix is this: if in the alignment, you find at least one sequence (here, ATTTTA) with more than one gapped alignment (here the last two gapped seqs in the alignment), do not attempt seq clustering/recursive prg construction. just output the sequences as variants (here, ATTTTA vs ATTTTTTA)

The results in the comment above, reducing AMBIG calls, include this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant