Protein Representation Learning by Geometric Structure Pretraining

< Paper Reproduction Just for Fun >

Protein Representation Learning by Geometric Structure Pretraining

Paper Link
Author: Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, Jian Tang
Reproduced by: Seungwoo Ryu

Suppose all the snippets below start from your own root directory.
- Downloaded folder name is assumed to be a 'GearNet'.

Pretraining Dataset

Instead of using AlphaFoldDB(805K) for pretraining, I used Swiss-Prot(540K) protein dataset.
Disparity of the pretraining dataset can make subtle (or considerable) difference b/w the result of original paper and that of mine.
Can download the data at Here, or by
```
wget https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/swissprot_pdb_v3.tar -P ./
```
The expressions/schema of dataset might follow how doc1 or doc2 expresses each protein.

Downstream Dataset

Special Preprocessing on EC & GO
For EC Number Prediction and GO Term Prediction:
- First introducted by Paper
- Caution!
  - It is not possible to use their original data at all.
    - As this paper used contact map as a feature for the model, they didn't use explicit coordinate information of atoms. Therefore, their preprocessed files do not offer any info. about intact 3D coordinates which is essential on GearNet(-Variants). Even the .tfrecords files offered on Data section of the github page only contain information of contact map.
    - The code of the paper offers preprocessing code in preprocessing/data_collection.sh. However, the code in the 20th line
      wget https://cdn.rcsb.org/resources/sequence/clusters/bc-95.out -O $DATA_DIR/bc-95.out
      shows an error with the message Not Found. The requested URL was not found on this server.. Therefore, retrieving necessary information from original PDB file is impossible, and the command afterward is useless.
- My strategy is:
  1. Extract the pdb names from the data split given on the paper and gather all.
  2. Based on the collection of the name, download pdb file one by one from the web.
  3. Extract 3D coordinates information from the downloaded files.
  - After following these steps,
    EC: {'train': 4, 'valid': 3, 'test': 0} sets are inevitably omitted from the original dataset.
    GO: {'train': 18, 'valid': 1, 'test': 1} sets are inevitably omitted from the original dataset.
    whose coordinates are expressed awkward.
- Download the split info. of original paper by:
```
git clone https://github.com/flatironinstitute/DeepFRI
mkdir -p downstream/dataset/EC_GO
cp -r DeepFRI/preprocessing/data/* downstream/dataset/EC_GO/
```

For Fold Classification

First introduced by Paper

Can download the data at Here or by

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1chZAkaZlEBaOcjHQ3OUOdiKZqIn36qar' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1chZAkaZlEBaOcjHQ3OUOdiKZqIn36qar" -O HomologyTAPE.zip && rm -rf /tmp/cookies.txt

For Reaction Classification

Was introduced in a same paper introduced in Fold Classification

Can download the data at Here or by

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1udP6_90WYkwkvL1LwqIAzf9ibegBJ8rI' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1udP6_90WYkwkvL1LwqIAzf9ibegBJ8rI" -O ProtFunct.zip && rm -rf /tmp/cookies.txt

After running all the codes above, preparation for data is all done!

Preparation for the -

Environment


conda create -n GearNet python=3.8.5 
conda activate GearNet
pip install -r requirements.txt

Dataset

└ For Pretraining

mkdir -p uniprot/dataset
tar -xf swissprot_pdb_v3.tar -C ./uniprot/dataset

mkdir -p uniprot/interim
python GearNet/preprocess/preprocess_pt.py --data_dir ./uniprot/dataset --save_dir ./uniprot/interim

As mentioned before, the dataset the model is pretrained on is different from the original one.
- Swiss-Prot data does not have information about resolution (Appendix G).
- The only standard used for filtering: Incorrect records such as 53.353-100.177 at the position of coordinate information.
  - 4121 proteins among 542380 are excluded.
- Additionally, I excluded 3000 datasets for validation.
- So, the final number of data in train set is 535259.

└ For Downstream Task

Although datasets are already prepared in advance following published papers,
we need to pre-process more than those as we need 'coordinate' information for GearNet(-variants).
To extract coordinates info. from raw pdb files and make inputs for model, implement:
```
bash GearNet/preprocess/run_downstream.sh
```

└ Or You can download the preprocessed data from

Locate all the downloaded folders on the root directory.

https://drive.google.com/drive/folders/1aE3TPok3YfF-P5mchIbUmMe3195PlY9S?usp=sharing

Experiment

Following the original paper, all the experiments are set in a DistributedDataParallel(DDP) setting.

Pretraining

bash main.sh pretrain

Can manully change options on main.sh script for other options.
- For example, if you want to...
  
  Pretrain the GearNet-Edge model with MultiviewContrastiveLearning objective on GPU #0,1
  
  set options as
```
gpu="0 1"
enc_model="GearNet-Edge"
task_idx=0
```

Downstream

bash main.sh downstream

Can manually change options on main.sh script, likewise.
If you want to load pre-trained weights for inference, set load option to True
- Because I couldn't train a large model, I don't have any pretrained model to load which is trained on Pretraining objectives.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
criterions		criterions
datasets		datasets
metrics		metrics
models		models
preprocess		preprocess
trainers		trainers
utils		utils
.gitignore		.gitignore
main.py		main.py
main.sh		main.sh
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Representation Learning by Geometric Structure Pretraining

Pretraining Dataset

Downstream Dataset

Preparation for the -

Environment

Dataset

└ For Pretraining

└ For Downstream Task

└ Or You can download the preprocessed data from

Experiment

Pretraining

Downstream

About

Releases

Packages

Languages

tryumanshow/GearNet-Reproduce

Folders and files

Latest commit

History

Repository files navigation

Protein Representation Learning by Geometric Structure Pretraining

Pretraining Dataset

Downstream Dataset

Preparation for the -

Environment

Dataset

└ For Pretraining

└ For Downstream Task

└ Or You can download the preprocessed data from

Experiment

Pretraining

Downstream

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages