- Create a Python 3.8 environment, with conda or otherwise:
conda create -n cellotape python=3.8 -y
conda activate cellotape
- Install dependencies:
bash ./setup.sh
you must have the cuda toolkit & driver installed for the cuda version you use and set the CUDA_HOME variable
You must have unzip
installed (sudo apt install unzip
)
bash ./download_scripts/download_all.sh
Get (A) and (B) by running script:
- ogbn-arxiv |
bash download_scripts/ogbn_arxiv_orig_download_data.sh
- ogbn-products (subset) |
bash download_scripts/ogbn_products_download_data.sh
- arxiv_2023 |
bash download_scripts/arxiv_2023_download_data.sh
- Cora |
bash download_scripts/cora_download_data.sh
- PubMed |
bash download_scripts/pubmed_download_data.sh
Dataset | Description |
---|---|
ogbn-arxiv | The OGB provides the mapping from MAG paper IDs into the raw texts of titles and abstracts. Download the dataset here, unzip and move it to dataset/ogbn_arxiv_orig . |
ogbn-products (subset) | The dataset is located under dataset/ogbn_products_orig . |
arxiv_2023 | Download the dataset here, unzip and move it to dataset/arxiv_2023_orig . |
Cora | Download the dataset here, unzip and move it to dataset/cora_orig . |
PubMed | Download the dataset here, unzip and move it to dataset/PubMed_orig . |
Dataset | Description |
---|---|
ogbn-arxiv | Download the dataset here, unzip and move it to gpt_responses/ogbn-arxiv . |
ogbn-products (subset) | Download the dataset here, unzip and move it to gpt_responses/ogbn-products . |
arxiv_2023 | Download the dataset here, unzip and move it to gpt_responses/arxiv_2023 . |
Cora | Download the dataset here, unzip and move it to gpt_responses/cora . |
PubMed | Download the dataset here, unzip and move it to gpt_responses/PubMed . |
# python
import gdown
gdown.download_folder('https://drive.google.com/drive/folders/1hzTCaXh6qtZgoOC6_VPVZOBsA_fKcBft?usp=drive_link', quiet=False)
# one of ['cora' 'pubmed' 'ogbn-arxiv' 'arxiv_2023' 'ogbn-products']
python -m core.LMs.generate_embeddings \
--dataset_name ogbn-arxiv \
--lm_model_name Alibaba-NLP/gte-Qwen1.5-7B-instruct \
--add_instruction graph-aware # adds task-specific instruction to text
WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv
WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv lm.train.use_gpt True
python -m core.trainEnsemble gnn.model.name MLP
python -m core.trainEnsemble gnn.model.name GCN
python -m core.trainEnsemble gnn.model.name SAGE
python -m core.trainEnsemble gnn.model.name RevGAT gnn.train.lr 0.002 gnn.train.dropout 0.75
# Our enriched features
python -m core.trainEnsemble gnn.train.feature_type TA_P_E
# Our individual features
python -m core.trainGNN gnn.train.feature_type TA
python -m core.trainGNN gnn.train.feature_type E
python -m core.trainGNN gnn.train.feature_type P
# OGB features
python -m core.trainGNN gnn.train.feature_type ogb
python -m core.trainEnsemble gnn.train.feature_type TA dataset arxiv_2023 seed 42 gnn.model.name SAGE
Use run.sh
to run the codes and reproduce the published results.
This repository also provides the checkpoints for all trained models (*.ckpt)
and the TAPE features (*.emb)
used in the project. Please donwload them here.
The codes for constructing and processing the arxiv-2023
dataset are provided here.
PYTHONPATH=. pytest tests/