Skip to content

3. Download files

Abdurrahman Abul-Basher edited this page Jul 8, 2021 · 34 revisions

Overview

triUMPF requires a set of object files to run the core commands along with test samples to train and predict pathways. The test samples can either be used to train or test the triUMPF model. Please download these files from Zenodo. Once you have downloaded the triUMPF_materials.zip file, unzip it and make sure you obtain the three folders: objectset/, model/, and dataset/, as depicted below:

Note: This tree structure for the directory was generated using the tree command in the terminal (on Linux) and in the command prompt (on Windows).

triUMPF_materials/
	├── objectset/
        │       ├── biocyc.pkl
        │       ├── pathway2ec.pkl
        │       ├── pathway2ec_idx.pkl
        │       ├── hin.pkl
        │       ├── pathway2vec_embeddings.npz
        │       └── ...
	├── model/
        │       ├── triUMPF.pkl, triUMPF_C.pkl, triUMPF_K.pkl
        │       └── ...
	└── dataset/
                ├── biocyc205_tier23_9255_[X, Xe, y, species].pkl	
                ├── three_ecoli/
                │        ├── MG1655
                │        │      └── 0.pf
                │        ├── EDL933
                │        │      └── 0.pf
                │        └── CFT073
                │               └── 0.pf
                ├── three_ecoli_[X, Xe, triumpf_y, taxprune_pathologic_y, _notaxprune_pathologic_y].pkl	
                ├── golden_X.pkl, golden_Xe.pkl, golden_y.pkl
                ├── cami_X.pkl, cami_Xe.pkl, cami_y.pkl
                ├── symbionts_X.pkl, symbionts_Xe.pkl
                ├── hots_4_X.pkl, hots_4_Xe.pkl
                ├── M.pkl, P.pkl, E.pkl, A.pkl, B.pkl
                └── ...

A short description of the contents of the above folders is given below.

objectset/

In this folder, 8 core object files are provided that contain various pathway and enzyme information. These files are important for preprocessing, predicting, and training triUMPF. We will use the following five object files in this wiki:

File Description Size
biocyc.pkl An object containing the preprocessed MetaCyc database in the form of pathway ids, EC numbers, reaction ids, gene names, and gene ids, etc. 91.8MB
pathway2ec.pkl A matrix file representing the pathway-enzyme association. It contains 2526 pathway indices shown in the first column and 3650 enzymes (represented as EC numbers indices) in the remaining columns. 81.0kB
pathway2ec_idx.pkl A matrix of pathway2ec association indices. 29.4kB
hin.pkl A sample of heterogeneous information network. 10.5MB
pathway2vec_embeddings.npz A matrix file containing a sample of embeddings using RUST-norm. The rows (22593) correspond to the pathway, enzyme, and compound embeddings and the columns (128) represent the features. These features can be generated using pathway2vec. 11.6MB

Here, we show you a visual depiction of some of the object files to help deepen your understanding.

biocyc.pkl

The biocyc.pkl file contains the preprocessed MetaCyc database. Genes, proteins, enzymes, reactions, pathways, and compounds are all represented as dictionaries containing the individual IDs for each of the 6 categories. This file can be obtained by following the steps highlighted in prepBioCyc.

biocyc.pkl
	├── list_kb_paths       
        │     ├── metacyc       # The MetaCyc database
        │     └── ...           # Remaining databases (e.g. EcoCyc)
	├── processed_kb
        │     ├── metacyc
        │     │      ├── 0      # Protein related info
        │     │      ├── 1      # Compound related info
        │     │      ├── 2      # Gene related info
        │     │      ├── 3      # Enzyme related info
        │     │      ├── 4      # Reaction related info
        │     |      └── 5      # Pathway related info
        │     └── ...
	├── protein_id          # A 2-tuple (dictionary) representing protein ids and their indices 
	├── gene_id             # A 2-tuple (dictionary) representing gene ids and their indices 
	├── enzyme_id           # A 2-tuple (dictionary) representing enzymatic reaction ids and their indices 
	├── compound_id         # A 2-tuple (dictionary) representing compound ids and their indices 
	├── reaction_id         # A 2-tuple (dictionary) representing reaction ids and their indices 
	├── pathway_id          # A 2-tuple (dictionary) representing pathway frame ids and their indices 
	├── ec_id               # A 2-tuple (dictionary) representing EC numbers and their indices 
	├── gene_name_id        # A 2-tuple (dictionary) representing gene names and their indices 
	├── go_id               # A 2-tuple (dictionary) representing GO (gene ontology) ids and their indices 
	└── ...

pathway2ec.pkl

The pathway2ec.pkl file contains the pathway-enzyme associations with the values in the enzyme columns depicting the number of times an enzyme contributes to the pathways shown. For example, in the table below (after including pathway and EC numbers from "biocyc.pkl"), the enzyme ketol-acid reductoisomerase (NADP+) (EC-1.1.1.86) contributes 1 time to the L-valine biosynthesis pathway but does not contribute to any of the other pathways shown in the table.

Pathway EC-1.1.1.86 EC-1.3.1.9 EC-2.1.1.79 EC-2.2.1.6 EC-2.6.1.42 EC-2.6.1.13 EC-3.5.3.1 EC-4.2.1.59 EC-6.2.1.3 EC-6.3.2.M5
L-valine biosynthesis 1 0 0 1 1 0 0 0 0 0
L-arginine degradation VI (arginase 2 pathway) 0 0 0 0 0 1 1 0 0 0
cyclopropane fatty acid (CFA) biosynthesis 0 0 1 0 0 0 0 0 0 0
palmitate biosynthesis II (bacteria and plants) 0 7 0 0 0 0 0 7 2 0
jasmonoyl-amino acid conjugates biosynthesis I 0 0 0 0 0 0 0 0 0 1

pathway2vec_embeddings.npz

The pathway2vec_embeddings.npz is a matrix file corresponding to the embeddings of pathways, EC numbers, and compounds. These features are generated using pathway2vec. For example, after including pathway and EC numbers from "biocyc.pkl" in the first column and excluding compounds, the table can be seen as:

Pathway and EC 1 2 3 4 5 6 7 8 9 10
L-valine biosynthesis 0.089106 0.092924 0.089035 0.101823 0.072792 0.083173 0.096259 0.064823 0.071481 0.094392
methylquercetin biosynthesis 0.112329 0.075717 0.087717 0.094391 0.081035 0.074514 0.095572 0.072581 0.068458 0.096449
cyanide degradation 0.073566 0.094817 0.087664 0.099661 0.089182 0.103727 0.093147 0.093047 0.083330 0.095017
... ... ... ... ... ... ... ... ... ... ...
EC-1.1.1.10 0.095318 0.094138 0.097567 0.087115 0.084483 0.098668 0.078173 0.091465 0.086675 0.086497
EC-1.1.1.100 0.047987 0.096748 0.092529 0.092395 0.116745 0.092556 0.106274 0.107414 0.079025 0.098948
EC-1.1.1.101 0.090137 0.085566 0.087589 0.089496 0.082936 0.088855 0.083835 0.091411 0.085721 0.090588
... ... ... ... ... ... ... ... ... ... ...

model/

In this folder, a pre-trained model is provided to predict metabolic pathways using the datasets described in the dataset/ section.

File Description Size
triUMPF.pkl A pretrained model generated using biocyc205_tier23_9255_Xe.pkl and biocyc205_tier23_9255_y.pkl data with 90 pathway community and 100 enzyme community. 105MB
triUMPF_C.pkl A pathway community indicator matrix. It contains 2526 pathway indices shown in the first column and 90 community indices in the remaining columns. 1.22MB
triUMPF_K.pkl An enzyme (represented by EC numbers) community indicator matrix. It contains 3650 EC numbers indices shown in the first column and 100 community indices in the remaining columns. 1.96MB

triUMPF_C.pkl

This is a matrix file corresponding to the communities of pathways. Rows correspond to pathway indices and columns represent community indices. For example, after including pathways from "biocyc.pkl" in the first column, the table can be seen as:

Pathway 0 2 6 10 14 19 25 26 81 82
L-valine biosynthesis 0.000000e+00 0.000000 0.014566 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.381181 0.000000e+00 0.000000
L-arginine degradation VI (arginase 2 pathway) 0.000000e+00 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000e+00 0.268884
cyclopropane fatty acid (CFA) biosynthesis 0.000000e+00 1.945544 0.000000 1.159008e-36 5.421540e-71 1.427154e-200 0.000000 0.000000 0.000000e+00 0.000000
palmitate biosynthesis II (bacteria and plants) 0.000000e+00 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.476554 5.870750e-308 0.000000
jasmonoyl-amino acid conjugates biosynthesis I 1.485445e-316 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.700195 0.000000 0.000000e+00 0.000000

It can be seen that the L-valine biosynthesis pathway is most likely to be grouped under the community indexed by 26 (the highest value).

triUMPF_K.pkl

This is a matrix file corresponding to the communities of enzymes (represented by EC number indices). Rows correspond to EC number indices and columns represent community indices. For example, after including pathways from "biocyc.pkl" in the first column, the table can be seen as:

EC 8 23 27 34 54 58 62 72 82 98
EC-1.1.1.100 0.000000 0.00000 1.725058e-110 0.000000 0.000000 0.000000 0.213211 0.000000 0.000000 0.129857
EC-1.1.1.101 0.000000 0.00000 0.000000e+00 0.000000 0.095322 0.179075 0.000000 0.000000 0.000000 0.000000
EC-1.1.1.102 0.010917 0.08915 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
EC-1.1.1.103 0.000000 0.00000 3.676552e-02 0.017895 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
EC-1.1.1.105 0.000000 0.00000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.401225 0.004426 0.000000

It can be seen that the 3-oxoacyl-[acyl-carrier-protein] reductase (EC-1.1.1.100) enzyme is most likely to be grouped under the community indexed by 62 (the highest value).

dataset/

In this folder, 26 data are provided to predict, train, and evaluate metabolic pathways using the pre-trained triUMPF model (e.g., "triUMPF.pkl") or to train a new model. The data are categorized into the following three types: 1)- pathway training data, 2)- pathway test data, and 3)- other necessary data items.

1. Pathway training data

The following four files can be used to train triUMPF. Biocyc (v20.5) tier 2 and 3 PGDBs were processed using prepBioCyc.

File Description Size
biocyc205_tier23_9255_X.pkl A matrix file of 9255 organisms, whose information is extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. Columns (3650) for each organism, represent EC number indices filled with integer values indicating the abundance of ECs for that organism. 25.4MB
biocyc205_tier23_9255_Xe.pkl A matrix file of 9255 organisms, whose information is extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. Columns (3650) of each organism, represent EC number indices filled with integer values indicating the abundance of EC number indices and embeddings for that organism. 74.7MB
biocyc205_tier23_9255_y.pkl A binary matrix indicating the presence/absence of pathway indices (2526 entries) for each of the 9255 organisms. 63.3MB
biocyc205_tier23_9255_species.pkl A metadata in a tuple format (folder id, taxa id, species) consisting of file metadata information, extracted from Biocyc (v20.5) tier 2 and 3 PGDBs. 6.35MB

The following table depicts biocyc205_tier23_9255_X.pkl (after including taxa and species information from "biocyc205_tier23_9255_species.pkl"), where the first column represents the taxonomic identifiers (see NCBI Taxonomy) and the second column represents species name. The remaining columns represent EC numbers.

Taxa Species EC-1.1.1.10 EC-1.1.1.101 EC-1.1.1.102 EC-6.4.1.4 EC-6.4.1.5 EC-6.4.1.6 EC-6.4.1.7 EC-6.4.1.8 EC-6.4.1.b EC-6.5.1.8 EC-6.6.1.1 EC-6.6.1.2
TAX-887700 Acetobacter aceti 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
TAX-1048834 Alicyclobacillus acidocaldarius 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
TAX-521098 Alicyclobacillus acidocaldarius 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
TAX-1035194 Aggregatibacter actinomycetemcomitans 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TAX-1089447 Aggregatibacter actinomycetemcomitans 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

The following table depicts biocyc205_tier23_9255_y.pkl (after including taxa and species information from "biocyc205_tier23_9255_species.pkl"), where the first column represents the taxonomic identifiers (see NCBI Taxonomy) and the second column represents organism information. The remaining columns represent the pathways.

Taxa Species L-valine biosynthesis L-arginine degradation VI (arginase 2 pathway) cyclopropane fatty acid (CFA) biosynthesis almitate biosynthesis II (bacteria and plants) jasmonoyl-amino acid conjugates biosynthesis I pyridoxal 5'-phosphate salvage I adenosine deoxyribonucleotides de novo biosynthesis
TAX-887700 Acetobacter aceti 1.0 0.0 1.0 1.0 0.0 1.0 1.0
TAX-1048834 Alicyclobacillus acidocaldarius 1.0 1.0 0.0 1.0 0.0 0.0 1.0
TAX-521098 Alicyclobacillus acidocaldarius 1.0 1.0 0.0 1.0 0.0 0.0 1.0
TAX-1035194 Aggregatibacter actinomycetemcomitans 0.0 0.0 0.0 1.0 0.0 1.0 1.0
TAX-1089447 Aggregatibacter actinomycetemcomitans 1.0 0.0 0.0 1.0 0.0 1.0 1.0

2. Pathway test data

The following data can be used to perform pathway prediction and evaluation of the pre-trained triUMPF model. Please see the mlLGPR repository on how to obtain and preprocess the data below.

Files Description Size
three_ecoli/ This directory contains "0.pf" files for E. coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864). A tutorial on how to use this data type is provided in Tutorial on pathway prediction (Example 1). 767KB
three_ecoli_X.pkl, three_ecoli_Xe.pkl, three_ecoli_triumpf_y.pkl, three_ecoli_taxprune_pathologic_y.pkl, three_ecoli_notaxprune_pathologic_y.pkl These are preprocessed three_ecoli data and represented in a matrix format where rows correspond to E. coli str. CFT073, E. coli O157:H7 str. EDL933, and E. coli K-12 substr. MG1655, respectively. Columns for "X.pkl", "Xe.pkl", and "_y.pkl" correspond to 3650 EC number indices, 3778 EC number indices and embeddings, and 2526 pathway indices. The "_triumpf_y.pkl", "_taxprune_pathologic_y.pkl", and "_notaxprune_pathologic_y.pkl" data indicate the predicted pathway indices by triUMPF (triUMPF.pkl), Pathologic with taxonomic pruning, and Pathologic without taxonomic pruning, respectively. 116KB
golden_X.pkl, golden_Xe.pkl, golden_y.pkl This is the Golden dataset in a matrix format where rows correspond to AraCyc, EcoCyc, HumanCyc, LeishCyc, TrypanoCyc, and YeastCyc, respectively. Columns for "*X.pkl", "*Xe.pkl", and "*y.pkl" correspond to 3650 EC number indices, 3778 EC number indices and embeddings, and 2526 pathway indices. 154KB
cami_X.pkl, cami_Xe.pkl, cami_y.pkl These files correspond to the CAMI low complexity data with the rows representing 40 species. Columns for "*X.pkl", "*Xe.pkl", and "*y.pkl" correspond to 3650 EC number indices, 3778 EC number indices and embeddings, and 2526 pathway indices. 396KB
symbionts_X.pkl and symbionts_Xe.pkl These files correspond to the symbiont dataset with the rows representing: Moranella, Tremblaya, and a composition of both genomes. Columns for "*X.pkl" and "*Xe.pkl" correspond to 3650 EC number indices and 3778 EC number indices and embeddings. 13.1KB
hots_4_X.pkl and hots_4_Xe.pkl These files correspond to the Hawaii Ocean Time Series (HOTS) data at 10m (0, 1 rows), 75m (2, 3 rows), 110m (4 row), and 500m (5, 6 rows) ocean depth intervals. Columns for "*X.pkl" and "*Xe.pkl" correspond to 3650 EC number indices and 3778 EC number indices and embeddings. 172KB

The three_ecoli data corresponds to the three E. coli strains - E. coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864), that have been used for several benchmarking analysis. The 0.pf file represented below is the main input file to Pathway tools that pathologic uses to make pathway predictions. It contains the annotated enzymes that result from MetaPathways v2 comparing the open reading frames (ORFs) to the MetaCyc database.

Below is an example of "0.pf" file from E. coli strain K-12 substr. MG1655.

ID	ecoli-COLI-K12_0_2
NAME	ecoli-COLI-K12_0_2
STARTBASE	3734
ENDBASE	5020
PRODUCT	Threonine synthase # THRESYN-RXN 4.2.3.1
PRODUCT-TYPE	P
EC	4.2.3.1
//
ID	ecoli-COLI-K12_0_6
NAME	ecoli-COLI-K12_0_6
STARTBASE	8238
ENDBASE	9191
PRODUCT	Transaldolase # TRANSALDOL-RXN 2.2.1.2
PRODUCT-TYPE	P
EC	2.2.1.2
//
ID	ecoli-COLI-K12_0_7
NAME	ecoli-COLI-K12_0_7
STARTBASE	9306
ENDBASE	9893
PRODUCT	Molybdopterin adenylyltransferase # RXN-8344 2.7.7.75
PRODUCT-TYPE	P
EC	2.7.7.75
//

3. Other necessary data items

triUMPF requires additional data items for training and evaluation.

Files Description Size
M.pkl A matrix file representing the pathway-enzyme association with possible missing links due to white noise. It contains 2526 pathway indices shown in the first column and 3650 enzymes (represented as EC numbers indices) in the remaining columns. The file representation is similar to pathway2ec.pkl. 79.1KB
A.pkl A binary matrix file representing the pathway-pathway interaction. It contains 2526 pathway indices shown in the first column with their interactions (2526 pathway indices) in the remaining columns. 76.8KB
B.pkl A binary matrix file representing the enzyme-enzyme interaction. It contains 3650 enzymes (represented as EC numbers indices) shown in the first column with their interactions (3650 EC numbers indices) in the remaining columns. 152KB
P.pkl A matrix file representing the pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz. 3.42MB
E.pkl A matrix file representing the enzyme features (represented as EC numbers indices). It contains 3650 EC numbers indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to pathway2vec_embeddings.npz. 4.95MB

The A.pkl is a binary matrix that contains the pathway-pathway interactions with the entries of the matrix indicate whether pairs of pathways are adjacent or not. For example, in the table below, the palmitate biosynthesis II (bacteria and plants) pathway is adjacent to the palmitoleate biosynthesis II (plants and bacteria) pathway but is not linked to any of the other pathways shown in the table.

Pathway jasmonic acid biosynthesis L-proline degradation palmitoleate biosynthesis II (plants and bacteria) adenosine ribonucleotides de novo biosynthesis L-leucine biosynthesis phosphopantothenate biosynthesis I L-alanine biosynthesis I 3-methylbutanol biosynthesis (engineered) peramine biosynthesis
L-valine biosynthesis 0 0 0 0 1 1 1 1 0
L-arginine degradation VI (arginase 2 pathway) 0 1 0 0 0 0 0 0 1
palmitate biosynthesis II (bacteria and plants) 0 0 1 0 0 0 0 0 0
jasmonoyl-amino acid conjugates biosynthesis I 1 0 0 0 0 0 0 0 0
adenosine deoxyribonucleotides de novo biosynthesis 0 0 0 1 0 0 0 0 0

The B.pkl is a binary matrix that contains the enzyme-enzyme interactions with the entries of the matrix indicate whether pairs of enzymes (represented by EC number indices) are adjacent or not. For example, in the table below, the all-trans-retinol dehydrogenase (NAD+) (EC-1.1.1.105) is adjacent to the retinal dehydrogenase (EC-1.2.1.36) but is not linked to any of the other enzymes shown in the table.

EC EC-1.2.1.36 EC-2.3.1.179 EC-2.3.1.86 EC-4.2.1.59 EC-2.3.1.15 EC-2.3.1.29 EC-2.3.1.42 EC-2.3.1.50 EC-2.3.1.51 EC-2.5.1.26 EC-2.7.1.91
EC-1.1.1.100 0 1 1 1 0 0 0 0 0 0 0
EC-1.1.1.101 0 0 0 0 1 0 1 0 1 1 0
EC-1.1.1.102 0 0 0 0 0 0 0 1 0 0 1
EC-1.1.1.103 0 0 0 0 0 1 0 0 0 0 0
EC-1.1.1.105 1 0 0 0 0 0 0 0 0 0 0