This repository contains the code for the EnCodon and DeCodon models, codon-resolution large language models pre-trained on the NCBI Genomes database described in the paper "A Suite of Foundation Models Captures the Contextual Interplay Between Codons".
Currently, this is the only way to install the package but will push a pip installable version soon. To install the package from source, run the following command:
pip install git+https://github.com/goodarzilab/cdsFM.git
Now that you have cdsFM installed, you can use AutoEnCodon
and AutoDeCodon
classes which serve as wrappers around the pre-trained models. Here are some examples on how to use them:
Following is an example of how to use the EnCodon model to extract sequence embeddings:
from cdsFM import AutoEnCodon
# Load your dataframe containing sequences
seqs = ...
# Load a pre-trained EnCodon model
model = AutoEnCodon.from_pretrained("goodarzilab/encodon-620M")
# Extract embeddings
embeddings = model.get_embeddings(seqs, batch_size=32)
You can generate organism-specific coding sequences with DeCodon simply by:
from cdsFM import AutoDeCodon
# Load a pre-trained DeCodon model
model = AutoDeCodon.from_pretrained("goodarzilab/DeCodon-200M")
# Generate!
gen_seqs = model.generate(
taxid=9606, # NCBI Taxonomy ID for Homo sapiens
num_return_sequences=32, # Number of sequences to return
max_length=1024, # Maximum length of the generated sequence
batch_size=8, # Batch size for generation
)
EnCodon and DeCodon are pre-trained on coding sequences of length up to 2048 codons (i.e. 6144 nucleotides), including the <CLS> token prepended automatically to the beginning of the sequence and the <SEP> token appended at the end. The tokenizer's vocabulary consists of 64 codons and 5 special tokens namely <CLS>, <SEP>, <PAD>, <MASK> and <UNK>.
A collection of pre-trained checkpoints of EnCodon & DeCodon models are available on HuggingFace 🤗. Following table contains the list of available models:
Model | name | num. params | description | weights |
---|---|---|---|---|
EnCodon | encodon-80M | 80M | Pre-trained checkpoint | 🤗 |
EnCodon | encodon-80M-euk | 80M | Eukaryotic-expert | 🤗 |
EnCodon | encodon-620M | 620M | Pre-trained checkpoint | 🤗 |
EnCodon | encodon-620M-euk | 620M | Eukaryotic-expert | 🤗 |
DeCodon | decodon-200M | 200M | Pre-trained checkpoint | 🤗 |
DeCodon | decodon-200M-euk | 200M | Eukaryotic-expert | 🤗 |
@article{Naghipourfar2024,
title = {A Suite of Foundation Models Captures the Contextual Interplay Between Codons},
url = {http://dx.doi.org/10.1101/2024.10.10.617568},
DOI = {10.1101/2024.10.10.617568},
publisher = {Cold Spring Harbor Laboratory},
author = {Naghipourfar, Mohsen and Chen, Siyu and Howard, Mathew and Macdonald, Christian and Saberi, Ali and Hagen, Timo and Mofrad, Mohammad and Coyote-Maestas, Willow and Goodarzi, Hani},
year = {2024},
month = oct
}