Code for longevity : Put in longevity project
Code for depmap : Put in depmap project
Fetching embeddings from sequence : get_embeddings.py
get the filtered ortholog dataset :
curl -X GET \
"https://datasets-server.huggingface.co/first-rows?dataset=seq-to-pheno%2Ffiltered_orthologs&config=default&split=train"
get the mapped ortholog dataset :
curl -X GET \
-H "Authorization: Bearer $HF_TOKEN" \
"https://datasets-server.huggingface.co/rows?dataset=seq-to-pheno%2Fmapped_orthologs&config=default&split=train&offset=0&length=100"
from datasets import load_dataset
ds = load_dataset("seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data")
from mlcroissant import Dataset
ds = Dataset(jsonld="https://huggingface.co/api/datasets/seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data/croissant")
records = ds.records("default")
import pandas as pd
df = pd.read_csv("hf://datasets/seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data/protein_sequences_metadata.tsv", sep="\t")
from datasets import load_dataset
mapped = load_dataset("seq-to-pheno/mapped_orthologs")
from datasets import load_dataset
mapped = load_dataset("seq-to-pheno/filtered_orthologs")
python ./scripts/filtered_dataset.py --folder /downloads --template_path /seq_to_pheno/hug/zoonomia_dataset_repo_template/README.md --token hf_xxx --max_length 1000 --max_orthologs 20 --publish
To extract sequences for a specific gene and publish:
python extract_and_publish_protein_sequences.py --input_folder data/zoonomia/ --input_file protein_sequence_df.tsv --output_folder data/zoonomia/ --output_file TP53_protein_sequences.fasta --gene TP53 --publish --repo_name filtered-zoonomia-tp53 --hf_token hf_your_token
To extract all sequences and publish:
python extract_and_publish_protein_sequences.py --input_folder data/zoonomia/ --input_file protein_sequence_df.tsv --output_folder data/zoonomia/ --output_file all_protein_sequences.fasta --publish --repo_name filtered-zoonomia-all --hf_token hf_your_token