Signature 3 Detection

Dataset Preprocessing

This section describes the Python scripts used to ingest and process data for mutational signature analysis, from public datasets in the International Consortium for Cancer and Genomics (ICGC) Data Portal, with an emphasis on adherence to FAIR (Findable, Accessible, Interoperable, and Reusable) principles.

Consuming ICGC Data

The jupyter notebook notebooks/Consuming ICGC Data.ipynb shows how to consume simple somatic mutation data from the ICGC Data Portal via its APIs.

Processing ICGC simple somatic mutation data

The jupyter notebook Scripts/Dataset Preprocessing/Processing ICGC Data.ipynb shows how to process the simple somatic mutation (SSM) data that we downloaded from the ICGC Data Portal. After processing the SSM dataset, we generate a mutational spectra matrix, which tallies the counts for each context (as determined by the base on the 5' and the 3' ends of the mutated allele) of single base substitution. To obtain the context of the substitution, we require the GRCh37 reference genome. We obtain this genome via the UCGC Genome Browser API. The mutational spectra matrix for the BRCA datasets is stored in Data/MSK-Impact/WGS/. This dataset can be readily used for mutational signature analysis.

Simulating MSK-IMPACT 410 gene panel data from whole genome MAF files

The jupyter notebook Scripts/Dataset Preprocessing/Simulating MSK-IMPACT 410.ipynb shows how to simulate a mutational spectra from MSK-IMPACT 410 gene panel from the MAF files generated from the ICGC SSM whole genome analysis dataset.

Consuming MSK-IMPACT data and preparing its mutational spectra matrix

The jupyter notebook Scripts/Dataset Preprocessing/Consuming and Preparing MSK-IMPACT Data.ipynb shows how to use the cBioPortal web API to download MSK-IMPACT data from the 2017 paper: Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. The downloaded data is then converted into a mutational spectra matrix and stored at Data/MSK-Impact/Panel/.

Dataset Labeling

Bootstrap NNLS with Sigminer

Please follow the instructions in the Sigminer repository to set up your R environment to be able to run Sigminer.
Place RScript Scripts/RScripts/NNLS_Bootstrapping.R into the repository where the sigminer environment is set up
Follow the instructions in Scripts/RScripts/NNLS_Bootstrapping.R to obtain the labels from the WGS dataset

Model Evaluation and Deployment

Hyperparameter tuning

Run through the scripts in Scripts/Hyperparameter Tuning/ to retune each machine learning model's hyperparameters. To step up these scripts in a Google Colab environment, you can follow these steps:

Open and make a copy of the corresponding notebook for the model that you want to tune: Neural Network, Nearest-neighbors, XGBoost.
(Optional) Speed up the training process by enabling GPU usage.
Run the cells in sequential order.

Model Evaluation

The jupyter notebook Scripts/Main.ipynb runs 10-fold-cross-validation on each of the four machine-learning models tested: XGBoost, Neural Network, Nearest-neighbors, and Logistic Regression. This is done to pick out the optimal hyperparameters to use in the deployed models. You can use Google Colab to reproduce the model evaluation by following the steps here:

Open and make a copy of this Google Colab Notebook
(Optional) Speed up the training process by enabling GPU usage.
Run the cells in sequential order.

Model Deployment

Each of the four machine-learning models tested (XGBoost, Neural Network, Nearest-neighbors, and Logistic Regression) is ready for use at the GitHub pages website here: https://aaronge-2020.github.io/Sig3-Detection/

There are three steps outlined in this workflow pipeline:

The uploading of a MAF file or a mutational spectrum matrix. To see the required file format for the upload, please examine the examples provided in the Data/Example Files/ folder.
The comparison of the uploaded data and the training dataset. If the sample tested is an outlier datapoint, whose distribution falls outside of the training dataset's, then one should have less confidence in the model's predictive powers.
Select a machine-learning model and obtain a prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.idea		.idea
.vs		.vs
Data		Data
Scripts		Scripts
Tutorial Gifs		Tutorial Gifs
css		css
js		js
ydf		ydf
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Signature 3 Detection

Table of Contents

Dataset Preprocessing

Consuming ICGC Data

Processing ICGC simple somatic mutation data

Simulating MSK-IMPACT 410 gene panel data from whole genome MAF files

Consuming MSK-IMPACT data and preparing its mutational spectra matrix

Dataset Labeling

Bootstrap NNLS with Sigminer

Model Evaluation and Deployment

Hyperparameter tuning

Model Evaluation

Model Deployment

About

Releases

Packages

Languages

License

aaronge-2020/Sig3-Detection

Folders and files

Latest commit

History

Repository files navigation

Signature 3 Detection

Table of Contents

Dataset Preprocessing

Consuming ICGC Data

Processing ICGC simple somatic mutation data

Simulating MSK-IMPACT 410 gene panel data from whole genome MAF files

Consuming MSK-IMPACT data and preparing its mutational spectra matrix

Dataset Labeling

Bootstrap NNLS with Sigminer

Model Evaluation and Deployment

Hyperparameter tuning

Model Evaluation

Model Deployment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages