-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indels can cause spurious site production #27
Comments
Some good news, at the point of commit 68e032c (branch fix_27) I obtain the following number of AMBIG calls in a set of 14 P. falciparum validation samples:
Compared to tip of master:
I'm still evaluating the accuracy of calls on the new prgs, currently it looks like it improves some of my genes, but does worse in others. I'm investigating the worse ones |
Logging here another fix implemented in #28 imagine the msa program creates following alignment:
the last two sequences are the same, but the msa program somehow aligned them differently, the two alignments are equally valid i believe. NB: this is a MWE, but its not a theoretical point, this happens in practice in my pf data, albeit on longer super indel and repe My fix is this: if in the alignment, you find at least one sequence (here, ATTTTA) with more than one gapped alignment (here the last two gapped seqs in the alignment), do not attempt seq clustering/recursive prg construction. just output the sequences as variants (here, ATTTTA vs ATTTTTTA) The results in the comment above, reducing |
This is linked to #15 as it leads to production of ambiguous prgs (multiple paths give rise to same sequence)
Clustering code can conclude that there is no meaningful clustering of a set of sequences- e.g. puts each sequence in one cluster.
However the code that calls clustering can re-run prg-building for sets of sequences that only differ in alignment (i.e. gap positioning), not sequence (here)
This leads to spurious 'nested variants' and the following pathological example case:
msp6_ambig.pdf
Of the 4 paths between nodes labeled 55 and 56, two are identical. This cause gramtools to mis-genotype down the line.
I have a fix that I'm implementing and will PR in
The text was updated successfully, but these errors were encountered: