SPROF-GO is an alignment-free sequence-based protein function predictor through pretrained language model and homology-based label diffusion. SPROF-GO is easy to install and run, and is also accurate (surpassing the state-of-the-art sequence-based and even network-based methods) and really fast. Empirically, prediction on the three ontologies for 1000 sequences with an average length of 500 only takes about 7 minutes using an Nvidia GeForce RTX 3090 GPU. If your input is small, you can also use our SPROF-GO web server.
SPROF-GO is developed under Linux environment with:
python 3.8.5
numpy 1.19.1
scipy 1.5.2
torch 1.13.0
sentencepiece 0.1.96
transformers 4.17.0
tqdm 4.59.0
- Clone this repository by
git clone https://github.com/biomed-AI/SPROF-GO.git
(~ 1.4 GB) or download the code in ZIP archive (~ 630 MB) - Download the pretrained ProtT5-XL-UniRef50 model in here (~ 5.3 GB)
- Set the path variable
ProtTrans_path
in./script/predict.py
- Add permission to execute for DIAMOND by
chmod +x ./script/diamond
Simply run:
python ./script/predict.py --fasta ./example/demo.fa --outpath ./example/
And the prediction results will be saved in demo_top_preds.txt
and demo_all_preds.txt
under ./example/
. Here we provide the corresponding canonical input and prediction results under ./example/
for your reference.
Other parameters:
--top Besides the full predictions, also show the terms with top K predictive scores, default=20
--feat_bs Batch size for ProtTrans feature extraction, default=8
--pred_bs Batch size for SPROF-GO prediction, default=8
--save_feat Save intermediate ProtTrans features
--gpu Use GPU for feature extraction and SPROF-GO prediction
We provide the datasets and the trained models here for those interested in reproducing our paper.
The protein function datasets used in this study are stored in ./datasets/
as ZIP archives.
The trained SPROF-GO models can be found under ./model/
.
Citation:
@article{10.1093/bib/bbad117,
author = {Yuan, Qianmu and Xie, Junjie and Xie, Jiancong and Zhao, Huiying and Yang, Yuedong},
title = "{Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion}",
journal = {Briefings in Bioinformatics},
year = {2023},
month = {03},
issn = {1477-4054},
doi = {10.1093/bib/bbad117},
url = {https://doi.org/10.1093/bib/bbad117}
}
Contact:
Qianmu Yuan ([email protected])
Yuedong Yang ([email protected])