Introduction

SPROF-GO is an alignment-free sequence-based protein function predictor through pretrained language model and homology-based label diffusion. SPROF-GO is easy to install and run, and is also accurate (surpassing the state-of-the-art sequence-based and even network-based methods) and really fast. Empirically, prediction on the three ontologies for 1000 sequences with an average length of 500 only takes about 7 minutes using an Nvidia GeForce RTX 3090 GPU. If your input is small, you can also use our SPROF-GO web server.

System requirement

SPROF-GO is developed under Linux environment with:
python 3.8.5
numpy 1.19.1
scipy 1.5.2
torch 1.13.0
sentencepiece 0.1.96
transformers 4.17.0
tqdm 4.59.0

Set up SPROF-GO

Clone this repository by git clone https://github.com/biomed-AI/SPROF-GO.git (~ 1.4 GB) or download the code in ZIP archive (~ 630 MB)
Download the pretrained ProtT5-XL-UniRef50 model in here (~ 5.3 GB)
Set the path variable ProtTrans_path in ./script/predict.py
Add permission to execute for DIAMOND by chmod +x ./script/diamond

Run SPROF-GO for prediction

Simply run:

python ./script/predict.py --fasta ./example/demo.fa --outpath ./example/

And the prediction results will be saved in demo_top_preds.txt and demo_all_preds.txt under ./example/. Here we provide the corresponding canonical input and prediction results under ./example/ for your reference.

Other parameters:

--top           Besides the full predictions, also show the terms with top K predictive scores, default=20
--feat_bs       Batch size for ProtTrans feature extraction, default=8
--pred_bs       Batch size for SPROF-GO prediction, default=8
--save_feat     Save intermediate ProtTrans features
--gpu           Use GPU for feature extraction and SPROF-GO prediction

Dataset and model

We provide the datasets and the trained models here for those interested in reproducing our paper.
The protein function datasets used in this study are stored in ./datasets/ as ZIP archives.
The trained SPROF-GO models can be found under ./model/.

Citation and contact

Citation:

@article{10.1093/bib/bbad117,
    author = {Yuan, Qianmu and Xie, Junjie and Xie, Jiancong and Zhao, Huiying and Yang, Yuedong},
    title = "{Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion}",
    journal = {Briefings in Bioinformatics},
    year = {2023},
    month = {03},
    issn = {1477-4054},
    doi = {10.1093/bib/bbad117},
    url = {https://doi.org/10.1093/bib/bbad117}
}

Contact:
Qianmu Yuan ([email protected])
Yuedong Yang ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
datasets		datasets
example		example
image		image
model		model
script		script
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

System requirement

Set up SPROF-GO

Run SPROF-GO for prediction

Dataset and model

Citation and contact

About

Releases

Packages

Contributors 2

Languages

License

biomed-AI/SPROF-GO

Folders and files

Latest commit

History

Repository files navigation

Introduction

System requirement

Set up SPROF-GO

Run SPROF-GO for prediction

Dataset and model

Citation and contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages