This section describes the Python scripts used to ingest and process data for mutational signature analysis, from public datasets in the International Consortium for Cancer and Genomics (ICGC) Data Portal, with an emphasis on adherence to FAIR (Findable, Accessible, Interoperable, and Reusable) principles.
The jupyter notebook notebooks/Consuming ICGC Data.ipynb
shows how to consume simple somatic mutation data from the ICGC Data Portal via its APIs.
The jupyter notebook Scripts/Dataset Preprocessing/Processing ICGC Data.ipynb
shows how to process the simple somatic mutation (SSM) data that we downloaded from the ICGC Data Portal. After processing the SSM dataset, we generate a mutational spectra matrix, which tallies the counts for each context (as determined by the base on the 5' and the 3' ends of the mutated allele) of single base substitution. To obtain the context of the substitution, we require the GRCh37 reference genome. We obtain this genome via the UCGC Genome Browser API. The mutational spectra matrix for the BRCA datasets is stored in Data/MSK-Impact/WGS/
. This dataset can be readily used for mutational signature analysis.
The jupyter notebook Scripts/Dataset Preprocessing/Simulating MSK-IMPACT 410.ipynb
shows how to simulate a mutational spectra from MSK-IMPACT 410 gene panel from the MAF files generated from the ICGC SSM whole genome analysis dataset.
The jupyter notebook Scripts/Dataset Preprocessing/Consuming and Preparing MSK-IMPACT Data.ipynb
shows how to use the cBioPortal web API to download MSK-IMPACT data from the 2017 paper: Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. The downloaded data is then converted into a mutational spectra matrix and stored at Data/MSK-Impact/Panel/
.
Bootstrap NNLS with Sigminer
- Please follow the instructions in the Sigminer repository to set up your R environment to be able to run Sigminer.
- Place RScript
Scripts/RScripts/NNLS_Bootstrapping.R
into the repository where the sigminer environment is set up - Follow the instructions in
Scripts/RScripts/NNLS_Bootstrapping.R
to obtain the labels from the WGS dataset
Run through the scripts in Scripts/Hyperparameter Tuning/
to retune each machine learning model's hyperparameters. To step up these scripts in a Google Colab environment, you can follow these steps:
- Open and make a copy of the corresponding notebook for the model that you want to tune: Neural Network, Nearest-neighbors, XGBoost.
- (Optional) Speed up the training process by enabling GPU usage.
- Run the cells in sequential order.
The jupyter notebook Scripts/Main.ipynb
runs 10-fold-cross-validation on each of the four machine-learning models tested: XGBoost, Neural Network, Nearest-neighbors, and Logistic Regression. This is done to pick out the optimal hyperparameters to use in the deployed models. You can use Google Colab to reproduce the model evaluation by following the steps here:
- Open and make a copy of this Google Colab Notebook
- (Optional) Speed up the training process by enabling GPU usage.
- Run the cells in sequential order.
Each of the four machine-learning models tested (XGBoost, Neural Network, Nearest-neighbors, and Logistic Regression) is ready for use at the GitHub pages website here: https://aaronge-2020.github.io/Sig3-Detection/
There are three steps outlined in this workflow pipeline:
- The uploading of a MAF file or a mutational spectrum matrix. To see the required file format for the upload, please examine the examples provided in the
Data/Example Files/
folder. - The comparison of the uploaded data and the training dataset. If the sample tested is an outlier datapoint, whose distribution falls outside of the training dataset's, then one should have less confidence in the model's predictive powers.
- Select a machine-learning model and obtain a prediction.