CLIP-based model for aligning epitopes to immune receptors. [preprint]
🚧 This repository is under active construction. 🚧
-
Epitope Encoder
- PEFT-adapted Protein Language Models (e.g. ESM-2, ESM-3)
- Default: Using LoRA (rank = 8) on last 8 transformer layers
- Projection layer: FC linear layer (dim
$d_{e} \rightarrow d_p$ )-
$d_{e}$ is the original PLM dimension,$d_p$ is the projection dimension
-
-
Recepter Encoder
- PEFT-adapted Protein Language Models (e.g. ESM-2, ESM-3) or BCR/TCR Language Models (e.g. AbLang, TCR-BERT, etc.)
- Default: Using LoRA (rank = 8) on last 4 transformer layers
- Projection layer: FC linear layer (dim
$d_{r} \rightarrow d_p$ )-
$d_{r}$ is the original receptor LM dimension,$d_p$ is the projection dimension
-
- MixTCRPred Dataset (paper)
- Contains curated mixture of TCR-pMHC sequence data from IEDB, VDJdb, 10x Genomics, and McPAS-TCR
To run this application, set the following environment variables
WANDB_OUTPUT_DIR=<path to output dir>
Additionally, if training on top of a custom in-house TCR model, the following path needs to be set
INHOUSE_MODEL_CKPT_PATH=<path to custom model file>
# go to root directory of the repo, and then run:
python -m src.training --run-id [RUN_ID] --dataset-path [PATH_TO_DATASET] --stage fit --max-epochs 100 \\
--receptor-model-name [esm2|tcrlang|tcrbert] --projection-dim 512 --gpus-used [GPU_IDX] --lr 1e-3 \\
--batch-size 8 --output-dir [CHECKPOINTS_OUTPUT_DIR] [--mask-seqs]
# currently, running model on test stage embeds the test set epitope/receptor pairs with the fine-tuned model and saves them.
python -m src.training --run-id [RUN_ID] --dataset-path [PATH_TO_DATASET] --stage test --from-checkpoint [CHECKPOINT_PATH] \\
--projection-dim 512 --receptor-model-name [esm2|tcrlang|tcrbert] --gpus-used [GPU_IDX] --batch-size 8 \\
--save-embed-path [PATH_FOR_SAVING_EMBEDS]