Skip to content

gdolsten/seq-to-pheno

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

seq-to-pheno

Code for longevity : Put in longevity project

Code for depmap : Put in depmap project

Fetching embeddings from sequence : get_embeddings.py

Datasets

Get

get the filtered ortholog dataset :

curl -X GET \
     "https://datasets-server.huggingface.co/first-rows?dataset=seq-to-pheno%2Ffiltered_orthologs&config=default&split=train"

get the mapped ortholog dataset :

curl -X GET \
     -H "Authorization: Bearer $HF_TOKEN" \
     "https://datasets-server.huggingface.co/rows?dataset=seq-to-pheno%2Fmapped_orthologs&config=default&split=train&offset=0&length=100"

Use

from datasets import load_dataset

ds = load_dataset("seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data")
from mlcroissant import Dataset

ds = Dataset(jsonld="https://huggingface.co/api/datasets/seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data/croissant")
records = ds.records("default")
import pandas as pd

df = pd.read_csv("hf://datasets/seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data/protein_sequences_metadata.tsv", sep="\t")
from datasets import load_dataset

mapped = load_dataset("seq-to-pheno/mapped_orthologs")
from datasets import load_dataset

mapped = load_dataset("seq-to-pheno/filtered_orthologs")

Re-Create the filtered Ortholog Dataset:

python ./scripts/filtered_dataset.py --folder /downloads --template_path /seq_to_pheno/hug/zoonomia_dataset_repo_template/README.md --token hf_xxx --max_length 1000 --max_orthologs 20 --publish

Re-Create the Fasta Zoonotica Dataset:

To extract sequences for a specific gene and publish:

python extract_and_publish_protein_sequences.py --input_folder data/zoonomia/ --input_file protein_sequence_df.tsv --output_folder data/zoonomia/ --output_file TP53_protein_sequences.fasta --gene TP53 --publish --repo_name filtered-zoonomia-tp53 --hf_token hf_your_token

To extract all sequences and publish:

python extract_and_publish_protein_sequences.py --input_folder data/zoonomia/ --input_file protein_sequence_df.tsv --output_folder data/zoonomia/ --output_file all_protein_sequences.fasta --publish --repo_name filtered-zoonomia-all --hf_token hf_your_token

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages