BARTSmiles: Generative Masked Language Models for Molecular Representations

BARTSmiles is a chemical language model based on BART, trained on 1.7 billion SMILES strings from ZINC20 dataset.

BARTSmiles can be fine-tuned on chemical property prediction and generative tasks, including chemical reaction prediction and retrosynthesis. BARTSmiles allows to get multiple state-of-the-art results.

Hugging Face

You can use huggingface model from here.

Setup

Clone BARTSmiles repo in the root directory:

git clone https://github.com/YerevaNN/BARTSmiles.git

Setup a conda environment:

conda env create --file=./BARTSmiles/environment.yml
conda activate bartsmiles

Clone and install Fairseq in the root directory:

cd ./
git clone https://github.com/facebookresearch/fairseq.git
cd ./fairseq
pip install --editable ./

You need to add add_if_not_exist=False in this row:

tokens = self.task.source_dictionary.encode_line(bpe_sentence, append_eos=False, add_if_not_exist=False)

of this file: ./fairseq/fairseq/models/bart/hub_interface.py

NOTE! If you don't add this fairseq will be added new tokens in vocab of every unknown token instead of <unk>.

Download BARTSmiles pre-trained model and the vocabulary:

cd ./
mkdir -p ./chemical/tokenizer
cd ./chemical/tokenizer
wget http://public.storage.yerevann.com/BARTSmiles/chem.model
wget http://public.storage.yerevann.com/BARTSmiles/chem.vocab.fs

cd ./
mkdir ./chemical/checkpoints/evaluation_data
cd ./chemical/checkpoints
wget http://public.storage.yerevann.com/BARTSmiles/pretrained.pt
mv ./BARTSmiles/data_name ./chemical/checkpoints/evaluation_data
cd ./BARTSmiles/

Load the pretrained model

dict.txt is the vocab file without special tokens. You need to provide the structure of data_name directories.

from fairseq.models.bart import BARTModel

model = f"./chemical/checkpoints/evaluation_data/data_name/processed/input0"
bart = BARTModel.from_pretrained(model, checkpoint_file = f'./chemical/checkpoints/pretrained.pt',
                                 bpe="sentencepiece",
                                 sentencepiece_model=f"./chemical/tokenizer/chem.model")

Extract the features

Extract the last layer's features:

last_layer_features = bart.extract_features(bart.encode(smiles))

or you can use this file for batches:

python ./BARTSmiles/utils/extract_features.py --path [ the path where your BARTSmiles folder is located] --dataset-name esol --batch-size 32 --output-path [ where you want to locate the outputs]

Fine-tuning on MoleculeNet tasks

Download and preprocess MoleculeNet datasets: Use the following command from the BARTSmiles folder:

python preprocess/process_datasets.py --dataset-name esol --is-MoleculeNet True --root [the path where your BARTSmiles folder is located]

This will create folders in ./chemical/checkpoints/evaluation_data/esol directory:

    esol
    │
    ├───esol
    │      train_esol.csv
    │      valid_esol.csv
    │      test_esol.csv
    │
    │
    ├───processed
    │   │
    │   ├───input0
    │   │       dict.txt
    │   │       preprocess.log
    │   │       test.bin
    │   │       train.bin
    │   │       valid.bin
    │   │       test.idx
    │   │       valid.idx
    │   │       train.idx
    │   │
    │   └───label
    │          dict.txt
    │          preprocess.log
    │          test.bin
    │          valid.bin
    │          train.bin
    │          test.idx
    │          valid.idx
    │          train.idx 
    │          test.label
    │          valid.label
    │          train.label
    │
    │
    ├───raw
    |      test.input
    |      test.target
    |      valid.input
    |      valid.target
    |      train.input
    |      train.target
    |   
    |
    |
    └───tokenized
        test.input
        valid.input
        train.input

Generate the grid of training hyperparameters by running the script ./BARTSmiles/fine-tuning/generate_grid_bartsmiles.py. This will write grid search parameters in ./BARTSmiles/fine-tuning/grid_search.csv file.

Command for the regression tasks:

python fine-tuning/generate_grid_bartsmiles.py --root [the path where your BARTSmiles folder is located] --dataset-name esol --single-task True --dataset-size 1128 --is-Regression True

Command for the classification tasks having a single subtask:

python fine-tuning/generate_grid_bartsmiles.py --root [the path where your BARTSmiles folder is located] --dataset-name BBBP --single-task True --dataset-size 2039

Command for a specific subtask of a multilabel classification task:

python fine-tuning/generate_grid_bartsmiles.py --root [the path where your BARTSmiles folder is located] --dataset-name Tox21 --subtasks 12 --single-task False --dataset-size 7831

All required parameters for training are in grid_search.csv and you can start the training.

Login to your wandb Befor start the training you have to login in wandb for tracking the trainings. For login you can follow: https://docs.wandb.ai/ref/cli/wandb-login
Train the models using the following command:

mkdir ./chemical/log
python fine-tuning/train_grid_bartsmiles.py --root [the path where your BARTSmiles folder is located] --disk [the path where you want to store your checkpoints]  >> ./chemical/log/esol.log

This will produce a checkpoint in disk/clintox_1_bs_16_dropout_0.1_lr_5e-6_totalNum_739_warmup_118/ folder.

You will write wandb urls in ./BARTSmiles/evaluation/wandb_url.csv file example:

url

gayanec/Fine_Tune_clintox_0/6p76cyzr

Perform Stochastic Weight Averaging and evaluate from ./BARTSmiles/evaluation using the following command.

python evaluation/evaluate_swa_bartsmiles.py  --root [the path where your BARTSmiles folder is located] --disk [the path will your checkpoints be located] --dataset-type [dataset type: train, valid or test]

This will produce a log file with output and averaged checkpoints respectively in ./chemical/log/ and disk/clintox_1_bs_16_dropout_0.1_lr_5e-6_totalNum_739_warmup_118/ folders.

Note

If you want to fine-tune another dataset you have to add deatails in datasets.json files and your preprocessing code in ./preprocess/process_datasets.py file in line 103. The key must not contain the '_' symbol unless the following symbols are numbers.

Citation

@article{chilingaryan2022bartsmiles, title={Bartsmiles: Generative masked language models for molecular representations}, author={Chilingaryan, Gayane and Tamoyan, Hovhannes and Tevosyan, Ani and Babayan, Nelly and Khondkaryan, Lusine and Hambardzumyan, Karen and Navoyan, Zaven and Khachatrian, Hrant and Aghajanyan, Armen}, journal={arXiv preprint arXiv:2211.16349}, year={2022} }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BARTSmiles: Generative Masked Language Models for Molecular Representations

Hugging Face

Setup

Load the pretrained model

Extract the features

Fine-tuning on MoleculeNet tasks

Note

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

BARTSmiles: Generative Masked Language Models for Molecular Representations

Hugging Face

Setup

Load the pretrained model

Extract the features

Fine-tuning on MoleculeNet tasks

Note

Citation