PLACEHOLDER README

Python environment setup with Conda

Create a Python 3.8 environment, with conda or otherwise:

conda create -n cellotape python=3.8 -y
conda activate cellotape

Install dependencies:

bash ./setup.sh

you must have the cuda toolkit & driver installed for the cuda version you use and set the CUDA_HOME variable

Download all artefacts

You must have unzip installed (sudo apt install unzip)

bash ./download_scripts/download_all.sh

1. Download TAG datasets

Get (A) and (B) by running script:

ogbn-arxiv | bash download_scripts/ogbn_arxiv_orig_download_data.sh
ogbn-products (subset) | bash download_scripts/ogbn_products_download_data.sh
arxiv_2023 | bash download_scripts/arxiv_2023_download_data.sh
Cora | bash download_scripts/cora_download_data.sh
PubMed | bash download_scripts/pubmed_download_data.sh

A. Original text attributes

Dataset	Description
ogbn-arxiv	The OGB provides the mapping from MAG paper IDs into the raw texts of titles and abstracts. Download the dataset here, unzip and move it to `dataset/ogbn_arxiv_orig`.
ogbn-products (subset)	The dataset is located under `dataset/ogbn_products_orig`.
arxiv_2023	Download the dataset here, unzip and move it to `dataset/arxiv_2023_orig`.
Cora	Download the dataset here, unzip and move it to `dataset/cora_orig`.
PubMed	Download the dataset here, unzip and move it to `dataset/PubMed_orig`.

B. LLM responses

Dataset	Description
ogbn-arxiv	Download the dataset here, unzip and move it to `gpt_responses/ogbn-arxiv`.
ogbn-products (subset)	Download the dataset here, unzip and move it to `gpt_responses/ogbn-products`.
arxiv_2023	Download the dataset here, unzip and move it to `gpt_responses/arxiv_2023`.
Cora	Download the dataset here, unzip and move it to `gpt_responses/cora`.
PubMed	Download the dataset here, unzip and move it to `gpt_responses/PubMed`.

2. LM Stage / Generate Embeddings

To download embeddings

# python
import gdown
gdown.download_folder('https://drive.google.com/drive/folders/1hzTCaXh6qtZgoOC6_VPVZOBsA_fKcBft?usp=drive_link', quiet=False)

To just generate and save embeddings

# one of ['cora' 'pubmed' 'ogbn-arxiv' 'arxiv_2023' 'ogbn-products']
python -m core.LMs.generate_embeddings \
--dataset_name ogbn-arxiv \
--lm_model_name Alibaba-NLP/gte-Qwen1.5-7B-instruct \
--add_instruction graph-aware   # adds task-specific instruction to text

To fine-tune using the orginal text attributes

WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv

To fine-tune using the GPT responses

WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv lm.train.use_gpt True

3. Training the GNNs

To use different GNN models

python -m core.trainEnsemble gnn.model.name MLP
python -m core.trainEnsemble gnn.model.name GCN
python -m core.trainEnsemble gnn.model.name SAGE
python -m core.trainEnsemble gnn.model.name RevGAT gnn.train.lr 0.002 gnn.train.dropout 0.75

To use different types of features

# Our enriched features
python -m core.trainEnsemble gnn.train.feature_type TA_P_E

# Our individual features
python -m core.trainGNN gnn.train.feature_type TA
python -m core.trainGNN gnn.train.feature_type E
python -m core.trainGNN gnn.train.feature_type P

# OGB features
python -m core.trainGNN gnn.train.feature_type ogb

(Example) use only TA embeddings from LLM embedding model

python -m core.trainEnsemble gnn.train.feature_type TA dataset arxiv_2023 seed 42 gnn.model.name SAGE

4. Reproducibility

Use run.sh to run the codes and reproduce the published results.

This repository also provides the checkpoints for all trained models (*.ckpt) and the TAPE features (*.emb) used in the project. Please donwload them here.

arxiv-2023 dataset

The codes for constructing and processing the arxiv-2023 dataset are provided here.

Running Tests:

PYTHONPATH=. pytest tests/

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
core		core
dataset		dataset
download_scripts		download_scripts
gnn		gnn
gpt_preds		gpt_preds
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TA_P_ensemble.out		TA_P_ensemble.out
TA_ensemble.out		TA_ensemble.out
all_diffusion.sh		all_diffusion.sh
batch_embedding_generation.sh		batch_embedding_generation.sh
diffusion-results.csv		diffusion-results.csv
hyperparam_search.sh		hyperparam_search.sh
overview.svg		overview.svg
parse.py		parse.py
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.sh		run.sh
run_cellotape.sh		run_cellotape.sh
run_diffusion.sh		run_diffusion.sh
run_eval.py		run_eval.py
setup.sh		setup.sh
sgc.sh		sgc.sh
sign.sh		sign.sh
try-forward.ipynb		try-forward.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PLACEHOLDER README

Python environment setup with Conda

Download all artefacts

1. Download TAG datasets

A. Original text attributes

B. LLM responses

2. LM Stage / Generate Embeddings

To download embeddings

To just generate and save embeddings

To fine-tune using the orginal text attributes

To fine-tune using the GPT responses

3. Training the GNNs

To use different GNN models

To use different types of features

(Example) use only TA embeddings from LLM embedding model

4. Reproducibility

arxiv-2023 dataset

Running Tests:

About

Releases

Packages

Contributors 2

Languages

License

aaronzo/STAGE

Folders and files

Latest commit

History

Repository files navigation

PLACEHOLDER README

Python environment setup with Conda

Download all artefacts

1. Download TAG datasets

A. Original text attributes

B. LLM responses

2. LM Stage / Generate Embeddings

To download embeddings

To just generate and save embeddings

To fine-tune using the orginal text attributes

To fine-tune using the GPT responses

3. Training the GNNs

To use different GNN models

To use different types of features

(Example) use only TA embeddings from LLM embedding model

4. Reproducibility

arxiv-2023 dataset

Running Tests:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages