Obtaining phylogenetic OTUs via tree clustering

Overview

Phylogenetic clustering can be a useful tool for delineating evolutionarily distinct sequences in a set of queries without a taxonomic framework.

These methods have been used in the past for automatically estimating OTUs from both microbial 16S rRNA gene phylogenies¹ and concatenated marker gene trees ². A commonality of these past methods is inference of a phylogeny that includes all query sequences, i.e. each query is a leaf node, then clustering the leaves of the tree by some phylogenetic distance threshold.

Here, a similar process is taken to infer the primary set of clusters from the leaf nodes, but a secondary clustering step is used to break internal nodes into clusters as well. This step is eseential to support TreeSAPP's phylogenetic placement data as query sequences that are distantly represented in a reference package are often inserted into internal nodes of a tree. Once the clusters are defined, query sequences are assigned clusters based on their placement edge.

The value of these phylogenetic OTUs (pOTUs) is in providing users a quantitative measure of a sample's diversity (for a given reference package) regardless of how resolved the reference package's taxa are.

Note: the treesapp phylotu utility is currently in 'beta' development phase and has neither been published not rigourously tested. Significant changes may still come to this subcommand's function. Feedback is welcomed.

Supported versions >=0.9.6

Workflow description

treesapp phylotu is meant to 1. paritition a phylogeny into clusters based on evolutionary distance and 2. assign query sequences that were placed onto that phylogeny into clusters.

The process begins by calculating a distance threshold to separate the clusters, if one is not provided. This value is calculated from the 95th percentile of distances of monophyletic lineages. These distances are found by comparing leaves all-vs-all, finding the common ancestor between each leaf node pair, then if the common ancestor is monophyletic the distance from one leaf to another is calculated and collected. These distances can be controlled by changing the desired taxonomic rank for these clusters to reflect. For example, when species-level clusters are requested as a command-line argument, the distance threshold may be 0.1 and all nodes that are separated by less than 0.1 cumulative branch length distance would be grouped into a cluster. If this parameter were changed, to say family, the distance will increase, to 0.5 for example, and more nodes would be grouped into the family-level clusters.

Next, treesapp phylotu uses the max-diameter min-cut partitioning algorithm published by Balaban et al. (2019) for partitioning the tree nodes into clusters ³. Briefly, the nodes in the phylogeny are traversed in post-order. If the distances of a node's children sum to greater than the threshold, the child with the greater branch lengths is removed from the tree to form a cluster. If not, the parent node's distance is set to the sum of the distance to both children. The nodes are iterated over until all leaf nodes have been popped from the tree and thereby assigned to clusters.

However, the result of this algorithm are clusters where a leaf is in every cluster. The clusters at this stage are connected nodes and each one has either one or zero children. To create clusters that represent internal nodes (i.e. ancestral states), the clusters are iteratied over again. Again, the nodes are parsed from the distal node (e.g. leaf) to the proximal node (e.g. root) for each cluster. If the accumulated branch length of one or more nodes in a cluster exceeds the distance threshold the cluster is divided again into two clusters.

The final step involves assigning query sequences to these clusters based on their placement's edge number. The placements are parsed from the JPlace files generated by treesapp assign. While the queries are being assigned, the number of query sequences being assigned to each cluster is tracked and these are the data for the table. Simple!

Usage

Here is the command-line usage and options:

usage: treesapp phylotu [-v] [-h] -r PKG_PATH [PKG_PATH ...] [-o OUTPUT]
                        [-j JPLACE [JPLACE ...] | --assign_output TS_OUT
                        [TS_OUT ...]] [-t {class,order,family,genus,species}]
                        [-a ALPHA]

A tool for sorting query sequences placed on a phylogeny into
phylogenetically-inferred clusters.

Required parameters:
  -r PKG_PATH [PKG_PATH ...], --refpkg_path PKG_PATH [PKG_PATH ...]
                        Path to the reference package pickle (.pkl) file.
  -j JPLACE [JPLACE ...], --jplace JPLACE [JPLACE ...]
                        Path to one or more JPlace files generated by
                        placement on a reference package's phylogeny.
  --assign_output TS_OUT [TS_OUT ...]
                        Path to one or more output directories of treesapp
                        assign.

Optional options:
  -o OUTPUT, --output OUTPUT
                        Path to an output directory [DEFAULT = ./output/]
  -t {class,order,family,genus,species}, --tax_rank {class,order,family,genus,species}
                        The taxonomic rank the cluster radius should
                        approximately represent. [ DEFAULT = 'species' ].
  -a ALPHA, --alpha ALPHA
                        The evolutionary distance threshold defining the
                        cluster boundaries. [ DEFAULT = auto ].

Miscellaneous options:
  -v, --verbose         Prints a more verbose runtime log
  -h, --help            Show this help message and exit

There are two required arguments for treesapp phylotu:

refpkg_path: Path to a reference package .pkl. In future versions, just the name of the reference package will serve and TreeSAPP will look under the default reference package path in treesapp/data/.
Either jplace or assign_output: Paths to one or more .jplace files (these store the phylogenetic placement information) or paths to one or more treesapp assign output directories . These arguments are mutually exclusive, so be sure to only provide one!

The optional arguments are:

output: Path to an output directory. In there, two files called phylotu_matrix.tsv and phylotu_taxa.tsv will be written. More on these later.
tax_rank: A taxonomic rank. This rank (default being species) will be modelled by evolutionary distances in the reference package's phylogeny such that, under ideal circumstances, the query sequences in those clusters will all belong to the same taxon of the specified rank.
alpha: An evolutionary distance that will be used to separate the different phylogenetic clusters. If no value is provided, this threshold will be estimated using the taxonomic rank, as explained above.

Outputs

OTU table

The file phylotu_matrix.tsv is a tab-delimited matrix with the OTU identifiers occupyting the row names and the sample names occupying the column names. The intersection is the number of times a sequence was observed to belong to the pOTU for that sample. These resulting data can be handled similarly to OTUs derived from 16S rRNA gene amplicons.

Taxonomy

The file phylotu_taxa.tsv holds a two-column, tab-delimited table mapping pOTUs to their taxonomic labels. These labels are derived from a naive LCA of all descendents in the tree. This value is less accurate than that of the taxonomic assignment for query sequences by treesapp assign, typically being more poorly resolved. But... it's there if you want it!

References

Sharpton, T. J., Riesenfeld, S. J., Kembel, S. W., Ladau, J., O’Dwyer, J. P., Green, J. L., … Pollard, K. S. (2011). PhyLOTU: A high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLoS Computational Biology, 7(1). https://doi.org/10.1371/journal.pcbi.1001061
Wu, D., Doroud, L., & Eisen, J. A. (2013). TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees. ArXiv.
Balaban, M., Moshiri, N., Mai, U., Jia, X., & Mirarab, S. (2019). TreeCluster: Clustering biological sequences using phylogenetic trees. PloS One, 14(8), e0221068. https://doi.org/10.1371/journal.pone.0221068

Provide feedback

Saved searches

Use saved searches to filter your results more quickly