-
Notifications
You must be signed in to change notification settings - Fork 4
Obtaining phylogenetic OTUs via tree clustering
Phylogenetic clustering can be a useful tool for delineating evolutionarily distinct sequences in a set of queries without a taxonomic framework.
These methods have been used in the past for automatically estimating OTUs from both microbial 16S rRNA gene phylogenies1 and concatenated marker gene trees 2. A commonality of these past methods is inference of a phylogeny that includes all query sequences, i.e. each query is a leaf node, then clustering the leaves of the tree by some phylogenetic distance threshold.
Here, a similar process is taken to infer the primary set of clusters from the leaf nodes, but a secondary clustering step is used to break internal nodes into clusters as well. This step is eseential to support TreeSAPP's phylogenetic placement data as query sequences that are distantly represented in a reference package are often inserted into internal nodes of a tree. Once the clusters are defined, query sequences are assigned clusters based on their placement edge.
The value of these phylogenetic OTUs (pOTUs) is in providing users a quantitative measure of a sample's diversity (for a given reference package) regardless of how resolved the reference package's taxa are.
Note: the treesapp phylotu
utility is currently in 'beta' development phase and has neither been published not rigourously tested. Significant changes may still come to this subcommand's function. Feedback is welcomed.
Supported versions >=0.9.6
treesapp phylotu
is meant to 1. paritition a phylogeny into clusters based on evolutionary distance and 2. assign query sequences that were placed onto that phylogeny into clusters.
The process begins by calculating a distance threshold to separate the clusters, if one is not provided. This value is calculated from the 95th percentile of distances of monophyletic lineages. These distances are found by comparing leaves all-vs-all, finding the common ancestor between each leaf node pair, then if the common ancestor is monophyletic the distance from one leaf to another is calculated and collected. These distances can be controlled by changing the desired taxonomic rank for these clusters to reflect. For example, when species-level clusters are requested as a command-line argument, the distance threshold may be 0.1 and all nodes that are separated by less than 0.1 cumulative branch length distance would be grouped into a cluster. If this parameter were changed, to say family, the distance will increase, to 0.5 for example, and more nodes would be grouped into the family-level clusters.
Next, treesapp phylotu
uses the max-diameter min-cut partitioning algorithm published by Balaban et al. (2019) for partitioning the tree nodes into clusters 3.
Briefly, the nodes in the phylogeny are traversed in post-order. If the distances of a node's children sum to greater than the threshold, the child with the greater branch lengths is removed from the tree to form a cluster. If not, the parent node's distance is set to the sum of the distance to both children. The nodes are iterated over until all leaf nodes have been popped from the tree and thereby assigned to clusters.
However, the result of this algorithm are clusters where a leaf is in every cluster. The clusters at this stage are connected nodes and each one has either one or zero children. To create clusters that represent internal nodes (i.e. ancestral states), the clusters are iteratied over again. Again, the nodes are parsed from the distal node (e.g. leaf) to the proximal node (e.g. root) for each cluster. If the accumulated branch length of one or more nodes in a cluster exceeds the distance threshold the cluster is divided again into two clusters.
The final step involves assigning query sequences to these clusters based on their placement's edge number. The placements are parsed from the JPlace files generated by treesapp assign
. While the queries are being assigned, the number of query sequences being assigned to each cluster is tracked and these are the data for the table. Simple!
Here is the command-line usage and options:
usage: treesapp phylotu [-v] [-h] -r PKG_PATH [PKG_PATH ...] [-o OUTPUT]
[-j JPLACE [JPLACE ...] | --assign_output TS_OUT
[TS_OUT ...]] [-t {class,order,family,genus,species}]
[-a ALPHA]
A tool for sorting query sequences placed on a phylogeny into
phylogenetically-inferred clusters.
Required parameters:
-r PKG_PATH [PKG_PATH ...], --refpkg_path PKG_PATH [PKG_PATH ...]
Path to the reference package pickle (.pkl) file.
-j JPLACE [JPLACE ...], --jplace JPLACE [JPLACE ...]
Path to one or more JPlace files generated by
placement on a reference package's phylogeny.
--assign_output TS_OUT [TS_OUT ...]
Path to one or more output directories of treesapp
assign.
Optional options:
-o OUTPUT, --output OUTPUT
Path to an output directory [DEFAULT = ./output/]
-t {class,order,family,genus,species}, --tax_rank {class,order,family,genus,species}
The taxonomic rank the cluster radius should
approximately represent. [ DEFAULT = 'species' ].
-a ALPHA, --alpha ALPHA
The evolutionary distance threshold defining the
cluster boundaries. [ DEFAULT = auto ].
Miscellaneous options:
-v, --verbose Prints a more verbose runtime log
-h, --help Show this help message and exit
There are two required arguments for treesapp phylotu
:
-
refpkg_path: Path to a reference package .pkl. In future versions, just the name of the reference package will serve and TreeSAPP will look under the default reference package path in
treesapp/data/
. - Either jplace or assign_output: Paths to one or more .jplace files (these store the phylogenetic placement information) or paths to one or more
treesapp assign
output directories . These arguments are mutually exclusive, so be sure to only provide one!
The optional arguments are:
-
output: Path to an output directory. In there, two files called
phylotu_matrix.tsv
andphylotu_taxa.tsv
will be written. More on these later. - tax_rank: A taxonomic rank. This rank (default being species) will be modelled by evolutionary distances in the reference package's phylogeny such that, under ideal circumstances, the query sequences in those clusters will all belong to the same taxon of the specified rank.
- alpha: An evolutionary distance that will be used to separate the different phylogenetic clusters. If no value is provided, this threshold will be estimated using the taxonomic rank, as explained above.
The file phylotu_matrix.tsv
is a tab-delimited matrix with the OTU identifiers occupyting the row names and the sample names occupying the column names. The intersection is the number of times a sequence was observed to belong to the pOTU for that sample. These resulting data can be handled similarly to OTUs derived from 16S rRNA gene amplicons.
The file phylotu_taxa.tsv
holds a two-column, tab-delimited table mapping pOTUs to their taxonomic labels. These labels are derived from a naive LCA of all descendents in the tree. This value is less accurate than that of the taxonomic assignment for query sequences by treesapp assign
, typically being more poorly resolved. But... it's there if you want it!
- Sharpton, T. J., Riesenfeld, S. J., Kembel, S. W., Ladau, J., O’Dwyer, J. P., Green, J. L., … Pollard, K. S. (2011). PhyLOTU: A high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLoS Computational Biology, 7(1). https://doi.org/10.1371/journal.pcbi.1001061
- Wu, D., Doroud, L., & Eisen, J. A. (2013). TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees. ArXiv.
- Balaban, M., Moshiri, N., Mai, U., Jia, X., & Mirarab, S. (2019). TreeCluster: Clustering biological sequences using phylogenetic trees. PloS One, 14(8), e0221068. https://doi.org/10.1371/journal.pone.0221068