Use type strains for GTDB to NCBI taxonomy translation #552

fplazaonate · 2023-10-16T12:32:51Z

Hi,

The gtdb_to_ncbi_majority_vote.py is great but is subject to biases when multiple genomes are incorrectly annotated on the NCBI.

Have you considered implementing more complex rules such as:

Give more weight to genomes representative of type strains?
Give more weight to genomes included in RefSeq?

I have performed some tests and it helped a lot to recover correct NCBI taxonomy at species level.

Best,
Florian

donovan-h-parks · 2023-10-16T16:33:59Z

Hi Florian,

Interesting ideas. I'm not surprised to hear that the majority vote method used in gtdb_to_ncbi_majority_vote.py doesn't always produce the best NCBI taxonomy string. We aren't actively working on improving this script. Did you have some improvements that could be provided as a PR? Ideally, something that users could opt in to using if they don't want a strict majority vote.

Thanks,
Donovan

fplazaonate · 2023-10-18T08:34:48Z

Hi Donovan,

Thanks for your feedback.
I have no clean code ready for a PR but I may work on it.

Here is what I performed so far:

Match GTDB with NCBI taxonomy using genomes from type material
If not available, use RefSeq representative genomes.

Among 15,561 GTDB taxonomy entries, this strategy provides more precise annotation in 10% of cases.
This is a relatively small gain, but I think that this curated dataset could be used for for matching at higher taxonomic ranks in place of the entire NCBI Genbank.

Best,
Florian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use type strains for GTDB to NCBI taxonomy translation #552

Use type strains for GTDB to NCBI taxonomy translation #552

fplazaonate commented Oct 16, 2023

donovan-h-parks commented Oct 16, 2023

fplazaonate commented Oct 18, 2023

Use type strains for GTDB to NCBI taxonomy translation #552

Use type strains for GTDB to NCBI taxonomy translation #552

Comments

fplazaonate commented Oct 16, 2023

donovan-h-parks commented Oct 16, 2023

fplazaonate commented Oct 18, 2023