-
Notifications
You must be signed in to change notification settings - Fork 4
Evaluating taxonomic classification accuracy
Clade exclusion analysis is a method for evaluating a software's taxonomic classification performance 1,2. The goal of it is to simulate scenarios where the reference database does not adequately reflect the full diversity of a sample, or a set of query sequences. By starting with a database that at first represents the query sequences, it proceeds by iteratively removing cohesive taxonomic groups (i.e. a clade) from the reference database, classifying query sequences belonging to these clades, then replacing them before carrying on to the next one. These clades often represent not just different taxa of the same rank, but also taxa of different rank. Therefore, clade exclusion analysis allows one to approximate expected classification performances with real community data, where queries vary in relation to the database, in a controlled experiment.
The TreeSAPP command treesapp evaluate
uses clade exclusion analysis to estimate a reference package's taxonomic classification performance for any taxonomic rank. It is useful to know this once you're created a new reference package and before you use treesapp assign
.
Supported versions >=0.9.0
Here is the command-line usage and options, printed to the console by treesapp evaluate -h
:
usage: treesapp evaluate [-v] [-h] -i INPUT [-o OUTPUT] [--delete]
[--trim_align] [-w MIN_SEQ_LENGTH]
[-m {prot,dna,rrna}] -r PKG_PATH [PKG_PATH ...]
[--accession2taxid ACC_TO_TAXID] [-a ACC_TO_LIN]
[--overwrite] [-n NUM_THREADS]
[--taxonomic_ranks {domain,phylum,class,order,family,genus,species} [{domain,phylum,class,order,family,genus,species} ...]]
[--fresh] [--tool {treesapp,graftm,diamond}]
[-l LENGTH]
[--stage {continue,lineages,classify,calculate}]
Evaluate classification performance using clade-exclusion analysis.
Required parameters:
-i INPUT, --fastx_input INPUT
An input file containing DNA or protein sequences in
either FASTA or FASTQ format
-r PKG_PATH [PKG_PATH ...], --refpkg_path PKG_PATH [PKG_PATH ...]
Path to the reference package pickle (.pkl) file.
Optional options:
-o OUTPUT, --output OUTPUT
Path to an output directory [DEFAULT = ./output/]
--trim_align Flag to turn on position masking of the multiple
sequence alignment [DEFAULT = False]
-w MIN_SEQ_LENGTH, --min_seq_length MIN_SEQ_LENGTH
minimal sequence length after alignment trimming
[DEFAULT = 30]
-m {prot,dna,rrna}, --molecule {prot,dna,rrna}
Type of input sequences (prot = protein; dna =
nucleotide [DEFAULT]; rrna = rRNA)
--taxonomic_ranks {domain,phylum,class,order,family,genus,species} [{domain,phylum,class,order,family,genus,species} ...]
A list of the taxonomic ranks (space-separated) to
test. [ DEFAULT = class species ]
--fresh Recalculate a fresh phylogenetic tree with the target
clades removed instead of removing the leaves
corresponding to targets from the reference tree.
--tool {treesapp,graftm,diamond}
Classify using one of the tools: treesapp [DEFAULT],
graftm, or diamond.
-l LENGTH, --length LENGTH
Arbitrarily slice the input sequences to this length.
Useful for testing classification accuracy for
fragments.
--stage {continue,lineages,classify,calculate}
The stage(s) for TreeSAPP to execute [DEFAULT =
continue]
Taxonomic-lineage arguments:
--accession2taxid ACC_TO_TAXID
Path to an NCBI accession2taxid file for more rapid
accession-to-lineage mapping.
-a ACC_TO_LIN, --accession2lin ACC_TO_LIN
Path to a file that maps sequence accessions to
taxonomic lineages, possibly made by `treesapp
create`...
Miscellaneous options:
-v, --verbose Prints a more verbose runtime log
-h, --help Show this help message and exit
--delete Delete all intermediate files to save disk space.
--overwrite overwrites previously processed output folders
-n NUM_THREADS, --num_procs NUM_THREADS
The number of CPU threads or parallel processes to use
in various pipeline steps [DEFAULT = 2]
Summary of important arguments:
-
-i/--fastx_input
is a fasta file containing query sequences. It's best to use a set of queries with a broad taxonomic range, unless you're scope is small and focused on a particular taxon. Using the same fasta file provided totreesapp create
is ideal. -
-r/--refpkg_path
is the path to the reference package .pkl file. -
--taxonomic_ranks
dictates what ranks are evaluated. By default this is just 'class' and 'species', to approximate performance at both ends of the spectrum. -
-l/--length
can be used to estimate the classification performance of sequences that are randomly trimmed to a specific length. For instance, if you plan on classifying ORFs predicted from unassembled or merged Illumina reads, you should probably estimate the performance of query sequences truncated to 50 to ~200 characters (150 bp and 300 bp merged paired-end reads, respectively). -
--fresh
will trigger TreeSAPP to recreate the phylogeny for every round of clade exclusion analysis, instead of pruning the leaf nodes from the tree representing excluded clades. This will require significantly more time to complete and is generally not recommended. -
-a/--accession2lin
refers to a tab-separated value file with a one-to-one mapping between a query sequence's accession and its taxonomic lineage. This file is automatically created bytreesapp create
, but may be deleted if the--delete
flag was invoked.
treesapp evaluate
can be ran iteratively, such that if you first ran it with --taxonomic_ranks class
you would be able to later re-run with --taxonomic_ranks genus species
and the results would be retabulated for class, genus and species - no need to start over again! That is, unless you included either the --overwrite
or --delete
flags...
treesapp evaluate
begins by collecting the taxonomic lineage information for all query sequences. This can be effectively skipped by providing the accession2lin
table.
Once the taxonomic lineages are known, the next step determines the set of query sequences that are to be used to evaluate the reference package at each taxonomic rank specified. Not every taxon can be used as often times a taxon will only be represented by one child taxon (e.g. Methylosarcina barkeri is the only species representing the genus Methylosarcina). In these cases, if the child taxon is removed, the optimal taxonomic classification will not represent a genus-level relationship, but potentially family or even more distant. These taxa are skipped to avoid misrepresenting the relationship between accuracy and taxonomic rank.
Something to note: taxonomic rank is used here as a proxy for evolutionary divergance, or an approximation of how related the query sequences are to the reference database. If you're interested in knowing how treesapp assign
performs when the query sequences are distantly related to the reference package, as would be the case when classifying query sequences from an understudied environment, then use 'class' or 'order'. If you're just interested in knowing the performance when the queries are well represented by the reference package, use 'species'.
With the sequences that are to be used for evaluating the reference package clade-exclusion analysis begins. Briefly, a new reference package is created without the reference sequences belonging to a taxon. Query sequences that belong to this taxon are then classified with the new reference package and the taxonomic labels assigned to the query sequences are saved before moving on to the next taxon.
After exhaustively performing clade exclusion analysis for all taxa of all ranks the classifications are analysed. Here, the distance from the optimal taxonomic assignment are calculated by comparing the assigned label to the taxa that were in the reference package used for classification. For example, if a taxon ('d__Bacteria; p__Nitrospirae; c__Nitrospira; o__Nitrospirales; f__Nitrospiraceae; g__Nitrospira') was assigned the label 'd__Bacteria; p__Nitrospirae; c__Nitrospira' and the closest relative was 'd__Bacteria; p__Nitrospirae; c__Nitrospira; o__Nitrospirales; f__Nitrospiraceae' the distance would be 2.
Three files summarizing the clade-exclusion analysis results can be found in the final_ouputs/
directory:
- accuracy.tsv: Honestly, not super useful. Provides the percentage of sequences that were correctly classified at each taxonomic rank. For instance, if the percentage at Phylum was 99.8, that would mean that if the taxonomic labels were truncated to the rank of phylum, 99.8% of these labels would be correct.
- clade_exclusion_performance.tsv: Summarises the number of sequences that were correctly assigned within a taxonomic distance, as well as the sequences that were over- (towards species) and under-classified (towards Root). The taxonomic distance controls the acceptable error. A distance of seven includes all query sequences where as a distance of two includes only the sequences that were classified within two ranks of the perfect taxonomic classification.
- taxonomic_recall.tsv: Provides the recall for each taxon. Columns are 'Rank', 'Taxon' and 'Recall'.
- Brady, A., & Salzberg, S. L. (2009). Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6(9), 673–676. https://doi.org/10.1038/nmeth.1358
- Peabody, M. A., Van Rossum, T., Lo, R., & Brinkman, F. S. L. (2015). Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics, 16(1), 363. https://doi.org/10.1186/s12859-015-0788-5