You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Interesting ideas. I'm not surprised to hear that the majority vote method used in gtdb_to_ncbi_majority_vote.py doesn't always produce the best NCBI taxonomy string. We aren't actively working on improving this script. Did you have some improvements that could be provided as a PR? Ideally, something that users could opt in to using if they don't want a strict majority vote.
Thanks for your feedback.
I have no clean code ready for a PR but I may work on it.
Here is what I performed so far:
Match GTDB with NCBI taxonomy using genomes from type material
If not available, use RefSeq representative genomes.
Among 15,561 GTDB taxonomy entries, this strategy provides more precise annotation in 10% of cases.
This is a relatively small gain, but I think that this curated dataset could be used for for matching at higher taxonomic ranks in place of the entire NCBI Genbank.
Hi,
The gtdb_to_ncbi_majority_vote.py is great but is subject to biases when multiple genomes are incorrectly annotated on the NCBI.
Have you considered implementing more complex rules such as:
I have performed some tests and it helped a lot to recover correct NCBI taxonomy at species level.
Best,
Florian
The text was updated successfully, but these errors were encountered: