Skip to content

Taxonomy Check

thibaudnis edited this page Feb 7, 2020 · 9 revisions

The pipeline will verify the organism name provided on input when the pgap.py flags --taxcheck or --taxcheck-only are used.

The taxonomy check assesses whether the organism name provided in the YAML input file matches the input genome sequence. Using average nucleotide identity (ANI), it compares the input genome sequence to the genomes of the type strains in GenBank. In a first step, the set of type assemblies to which the input sequence is most closely related is determined via k-mer analysis. This set assemblies is then aligned to the input sequence with pairwise MegaBLAST. The percent identity of the resulting filtered reciprocal best hits is declared as the overall genome-to-type-assembly ANI.

For most species, we use an ANI threshold of 96% identity to declare that a query assembly matches a type assembly.

For more information is available in this publication.

Possible ANI statuses

The status returned by the taxonomy check can be one of the following:

CONFIRMED: The submitted organism name has been confirmed by ANI. A species can be confirmed by the following methods:

  • The assembly matches a type and both are of the same species.
  • The assembly matches a type and at least one is subspecies of the same species.
  • The assembly lacks a submitted full binomial name (i.e., submitted oraganism is a "sp.", or at genus level), matches a type, and both share the same genus.
  • The assembly matches a type of a species that was added to a specialized synonymy list designed to cover difficult-to-handle cases of typing.

MISASSIGNED: The submitted organism name has been found to be misassigned to the query assembly.

  • The assembly matches a type for a different species.
  • If the submitted organism name is a "sp.", there is a mismatch at the genus level.

INCONCLUSIVE: The organism cannot be identified.

  • There is no type assembly available for the submitted organism name.
  • The assembly matches a type at the same species, but the ANI is below the species ANI threshold.
  • The assembly matches a type at a different species, but the ANI is below the species ANI threshold.
  • The assembly and closest type do not share enough sequence to make a determination.

CONTAMINATED: Contamination in genome assemblies will be reported if the following conditions are met:

  • We have a reference covering at least 50% of the assembly
  • We have a single taxon accounting for at least 10% of the coverage and at least half of the remaining sequence.

Description of the reports

The taxonomy check will produce three reports:

ani-tax-report.txt

This file provides the results of the taxonomy check in text format. It includes

  • Submitted organism name: the organism declared by the submitter, along with taxid, rank (ex: species), and taxonomic lineage.
  • Predicted organism name: the organism identity determined by ANI. This may be the same as the submitted organism name.
  • Submitted organism has type: possible values are Yes and No. Indicates whether there is a public genome assembly available for the type strain of the declared species.
  • Status: possible values are CONFIRMED, MISASSIGNED, INCONCLUSIVE or CONTAMINATED (see above)
  • Confidence: possible values are HIGH or LOW. Indicates the confidence level of the stated contamination. Confidence HIGH: the ANI criteria meets the expected cutoff (96% for most prokaryotic taxa). Confidence LOW: the ANI criteria does not meet the expected cutoff, but has provided the best prediction possible based on currently available data.
  • ANI statistics: A table with the following columns
    • Percent identity: the percent identity the submitted sequence has to a public type strain sequence of a different species.
    • (Query coverage, Subject coverage): The percent coverage of the query (submitted sequence) to the subject (public type strain), and the percent coverage of the subject (public type strain) to the query (submitted sequence) respectively
    • Genbank assembly ID: identifier for the GenBank assembly used in comparison.
    • Organism name: The organism name of the public type strain used for comparison.
    • Assembly accession, assembly name: The assembly accession and assembly name of the public type strain used for comparison.

ani-tax-report.xml

The same data as ani-tax-report.txt, but in XML format.

kmer-tax-report.xml

List of assemblies selected by kmer analysis for ANI calculation, and their kmer distance to the query assembly, in XML format.

Example reports

Example of a MISSASSIGNED report:

ANI report for assembly: my_gc_assm_name
Submitted organism: Rickettsia hoogstraalii (taxid = 467174, rank = species, lineage = Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiace
ae; Rickettsieae; Rickettsia; spotted fever group)
Predicted organism: Rickettsia japonica (taxid = 35790, rank = species, lineage = Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; R
ickettsieae; Rickettsia; spotted fever group)
Submitted organism has type: Yes
Status: MISASSIGNED
Confidence: HIGH
99.975 (99.8 99.8)  406738 assembly  Rickettsia japonica YH (GCA_000283595.1, ASM28359v1)
99.985 (99.4 99.9)  864348 assembly  Rickettsia japonica YH (GCA_000302635.2, ASM30263v2)
97.722 (96.5 97.1)  320558 assembly  Rickettsia slovaca 13-B (GCA_000237845.1, ASM23784v1)
98.893 (95.3 84.4) 6004488 assembly  Rickettsia fournieri (GCA_900243065.1, PRJEB23962)
97.100 (96.3 91.7)  834068 assembly  Rickettsia gravesii BWI-1 (GCA_000485845.1, RicGra1.0)
97.246 (95.9 83.0) 1655938 assembly  Rickettsia raoultii (GCA_000940955.1, ASM94095v1)
97.114 (95.8 96.9) 3973378 assembly  Rickettsia rickettsii (GCA_001951015.1, ASM195101v1)
97.115 (95.8 96.9) 3973358 assembly  Rickettsia rickettsii (GCA_001950995.1, ASM195099v1)
97.115 (95.8 96.9) 1526588 assembly  Rickettsia rickettsii str. Iowa (GCA_000017445.3, ASM1744v3)
[...]

Example of a CONTAMINATED report:

ANI report for assembly: my_gc_assm_name
Submitted organism: Staphylococcus aureus (taxid = 1280, rank = species,
lineage = Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae;
Staphylococcus)
Predicted organism: Staphylococcus aureus (taxid = 1280, rank = species,
lineage = Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae;
Staphylococcus)
Submitted organism has type: Yes
Status: CONTAMINATED
Confidence: HIGH
99.045 (54.5 80.3) 4972758 assembly Ochrobactrum quorumnocens
(GCA\_XXXXXXXXXX.1, ASMXXXXXXXXv1)

In the above example, the organism was declared by the submitter to be Staphylococcus aureus. The predicted organism was in agreement, but there was contamination found.

The contaminating organism was Ochrobactrum quorumnocens, which has a 99.045 identity over 54.5% of the sequence, representing 80.3% of the contaminting organism's genome.

Clone this wiki locally