A Phage from metagenomic bins (phamb) discovery approach used to isolate metagenome derived viromes and High-quality viral genomes. phamb is now published in Nature Communications, have a look and let us know if you have any questions.
The repository contains scripts and workflows used in our viral follow up study on the binning tool VAMB where we have benchmarked not only the quality and quantity of viral MAGs but also the viral overlap with metaviromes.
We have applied this approach to 3 different datasets and recovered up to 6,077 High-quality genomes from 1,024 viral populations, this is 200% more compared to only evaluation single-contigs. Similar to what we have observed for Bacterial bins, VAMB achieves high intra-VAMB-cluster ANI (>97.5%) also for viral bins, our best example here is accurate clustering of crAss-like bins found in the IBD Human Microbiome Project 2 dataset.
- Our (recommended) workflow is to isolate the virome search space prior to running viral evaluation/prediction tools. For this, we have trained a Random Forest model on viral bins established using paired metagenomic and metavirome datasets. This massively helps in reducing computational time especially on larger datasets.
- We strongly advise to only use Medium-quality and High-quality viral bins evaluated using the AAI-model in CheckV. We found the HMM-model is not currently well-suited for viral MAGs. Low-quality viral bins may likely represent fragmented/incomplete viruses or general contamination, we advice caution with using these.
In our analysis, CheckV has been important for assessing the actual gain of using viral MAGs relative to single-contig evaluation, a big kudos to Nayfach et al. for this great tool.
In order to run parallel annotations of contigs and running the Random Forest model you need snakemake and scikit-learn v. 1.0.2
. The snakemake workflows comes with conda-environments, thus dependencies and programmes are automatically installed. Phamb can now be installed via bioconda thanks to @jayramr!
### New dependencies *Recommended*
conda install -c conda-forge mamba
mamba create -n phamb python=3.9
conda activate phamb
mamba install -c conda-forge -c bioconda snakemake
mamba install -c conda-forge -c bioconda cython
mamba install -c conda-forge -c bioconda pygraphviz
mamba install -c conda-forge -c bioconda phamb
### Clone repository
git clone the repository https://github.com/RasmussenLab/phamb.git
### Alternative to bioconda - Quick install with pip
pip install -e .
### Test installation
mkdir -p testout
run_RF.py test/contigs.fna.gz test/clusters.tsv test testout
VAMB clusters and concatenated assemblies.
contigs.fna.gz #Concatenated assembly
vamb/clusters.tsv #Clustered contigs based on the above contigs.fna.gz file
Furthermore.
- VOGdb - untar
vog.hmm.tar.gz
to get all hmm files. Concatenate all the hmm-files these into anAllVOG.hmm
file [File path needs to be specified inconfig.yaml
] - Micomplete Bacterial HMMs [File path needs to be specified in
config.yaml
] - Clone DeepVirFinder git clone https://github.com/jessieren/DeepVirFinder
Copy the phamb repository, extract the mag_annotation
workflow and split contigs (using the provided script) to allow annotation to be run in parallel.
If you have relatively few contigs or have the patience to annotate all contigs in one batch you can skip the Snakemake part.
mkdir -p projectdir
cd projectdir
git clone the repository https://github.com/RasmussenLab/phamb.git
cp -r phamb/workflows/mag_annotation .
python split_contigs.py -c contigs.fna.gz
- Now the
contigs.fna.gz
is splitted into individual assemblies i.e.assembly/{sample}/{sample}.fna
- In addition, a
sample_table.txt
file is created with a line for each sample. Check thatsample_table.txt
contains sample identifiers corresponding to the ones you expect. The number of lines should correspond to the number of samples used to make the concatenated assembly (contigs.fna.gz
). - Now, specify paths for databases, vamb directory, location of assembly and computational resources in
mag_annotation/config.yaml
.
If everything is good and set, you can run the snakemake pipeline.
# Local
snakemake -s mag_annotation/Snakefile --use-conda -j <threads>
#Aggregate results
mkdir annotations
cat sample_annotation/*/*hmmMiComplete105.tbl > annotations/all.hmmMiComplete105.tbl
cat sample_annotation/*/*hmmVOG.tbl > annotations/all.hmmVOG.tbl
cat sample_annotation/*/*_dvf/*dvfpred.txt > annotations/DVF.predictions.txt
# Clean the DVF files for multiple headers.
head -n1 annotations/DVF.predictions.txt > DVF.header # get first header
grep -v 'pvalue' annotations/DVF.predictions.txt > DVF.predictions # get predictions
cat DVF.header DVF.predictions > annotations/all.DVF.predictions.txt # combine
Dependent on the number of samples, it may be relevant to run the Snake-flow on a High performance computing (HPC) server.
# HPC - this won't work unless you specify a legit group on your HPC in `config.yaml`
snakemake -s Snakefile --cluster qsub -j <threads> --use-conda
Make sure to have Prodigal, hmmer and DeepVirFinder depedencies installed. Check under mag_annotation/envs
for relevant conda environments.
mkdir annotations
gunzip contigs.fna.gz
python3 /user/DeepVirFinder/dvf.py -i contigs.fna -o DVF -l 2000 -c 1
mv DVF/contigs.fna_gt2000bp_dvfpred.txt annotations/all.DVF.predictions.txt
prodigal -i contigs.fna -d genes.fna -a proteins.faa -p meta -g 11
hmmsearch --cpu {threads} -E 1.0e-05 -o output.txt --tblout annotations/all.hmmMiComplete105.tbl <micompleteDB> proteins.faa
hmmsearch --cpu {threads} -E 1.0e-05 -o output.txt --tblout annotations/all.hmmVOG.tbl <VOGDB> proteins.faa
gzip contigs.fna
Running the provided script, the virome bins are written to a fasta file and bin-annotations are summarised in vambbins_aggregated_annotation.txt
.
run_RF.py contigs.fna.gz vamb/clusters.tsv annotations resultdir
ls resultsidr
resultdir/vambbins_aggregated_annotation.txt
resultdir/vambbins_RF_predictions.txt
resultsdir/vamb_bins #Concatenated predicted viral bins - writes bins in chunks to files so there might be several!
We recommend VAMB bins to be evaluated with a dedicated Viral evaluation tool like CheckV or VIBRANT to identify HQ viruses.
checkv end_to_end resultsdir/vamb_bins/vamb_bins.1.fna checkv_vamb_bins
- The RF model automates filtering of VAMB bins that are most likely bacterial and therefore provides a space of plausible viral entities for further validation. The RF-model has been trained on paired Metaviromes and Metagenomes to make precisde decisions based on simple parameteres as the ones below. Compared to a single contig viral prediction model, the RF approach is very accurate. The increased performance is likely explained by the RF model evaluating on bin-level where one sequence with a low viral score does NOT lead to a misprediction of the whole bin. Aggregated information (assuming the binning is really good!) from multiple-contigs improves prediction compared to single-contigs.
The RF model take few variables to make an accurate distinction.
binsize (bp) | nhallm | distinct_VOGs_factor | cluster_DVF_score |
---|---|---|---|
2.000.000 | 100 | 0.2 | 0.3 |
60.000 | 3 | 1.3 | 0.7 |
- Bacterial MAGs and viral MAGs from the same metagenome can be efficiently associated using crispr-spacer approaches and sequence alignment (recommended cutoffs can be found in the article). From this, Host-viral abundance dynamics and bacterial pangenome modulation can be studied. Downstream viral proteome analysis should be based on the
viral regions
found in thecontamination.tsv
file produced by CheckV to prevent contaminating bacterial genes to influence the analysis.