Platinum is a tool to run the HMF cancer analysis pipeline for any number of tumor samples in one easy command.
The HMF cancer analysis pipeline is a comprehensive pipeline specifically designed for analysing WGS tumor data with the following properties and constraints:
- FASTQ is assumed to be available for the sample(s) and is the starting point for the pipeline.
- The pipeline assumes that the tumor DNA sample is analysed against a single reference DNA sample.
- Do note that some individual algorithms support zero and/or multiple reference samples to allow "tumor-only" mode. These modes are not supported yet throughout the entire platinum process though.
- Both GRCh37 and GRCh38 reference genomes are supported throughout the pipeline and lead to comparable analysis quality.
The HMF pipeline primarily uses algorithms developed by HMF open-source and available via hmftools. In addition to the HMF algorithms, Platinum depends on a number of resources (databases) and some external tools and algorithms.
- Platinum aims to be based exclusively on open source code and databases that are free from any restrictions. However, the databases that are distributed as part of Platinum are not owned by HMF and their licenses could change. Below table lists all institutions from which we distribute data along with a link to their disclaimer and/or about and/or publication.
- HMF offers Platinum and the HMF cancer analysis pipeline on an ‘as-is’ basis.
- HMF assumes no liability whatsoever for general, special, incidental, consequential or any other type of damages arising out of the use or inability to use Platinum or a failure of Platinum to operate with any other programs. In as far as not yet sufficiently stipulated above, HMF expressly assumes no liability or responsibility whatsoever for the quality of the data that is being used in running Platinum or for the final configuration of the GCP project used by Platinum including anything related to security and encryption of data and any damages that may arise as a result thereof.
- HMF furthermore is not responsible for and assumes no liability whatsoever for damages resulting from the interpretation of the output of the HMF cancer analysis pipeline and the medical and/or scientific conclusions that are drawn on the basis of such interpretation.
Resource | Purpose | References |
---|---|---|
GRC makes available the human reference genome (GRCh37 or GRCh38) used in nearly every step of the pipeline. | N/A | |
The ensembl database is used extensively throughout our algorithms as the source for all gene and transcript annotations. | disclaimer | |
CIViC is a knowledgebase containing (pathogenic) mutations and linking them to treatments. In terms of the pipeline, CIViC's list of pathogenic variants contribute to the hotspot list used as our most sensitive calling tier and used for downstream interpretation of drivers. | - about - reference |
|
DoCM is a knowledgebase containing (pathogenic) mutations. In terms of the pipeline, DoCM's list of pathogenic variants contribute to the hotspot list used as our most sensitive calling tier and used for downstream interpretation of drivers. | - about - reference |
|
CGI is a knowledgebase containing (pathogenic) mutations and linking them to treatments. In terms of the pipeline, CGI's list of pathogenic variants contribute to the hotspot list used as our most sensitive calling tier and used for downstream interpretation of drivers. | - about - reference |
|
The NCBI viral reference database is used by the pipeline when it looks for viral presence and its integration into the analysed tumor DNA | - about - reference |
|
ClinVar contains evidence on pathogenicity of variants and is used for determining the pathogenicity of germline variants (in case a reference sample is supplied). | intro | |
The repeat masker database is largely derived from the GRC reference genome. This is used to interpret single end breaks in the DNA by mapping the other side of a single end break against a number of repeat masks. | N/A | |
The GIAB consortium's NA12878 high confidence regions are used by the pipeline. Thresholds are lowered when calling variants in a high confidence region vs a low confidence region. | N/A | |
PharmGKB maintains a database for pharmacogenetic evidence and is used to annotate potentially relevant haplotypes found by the pipeline. | - license - reference |
|
The IPD-IMGT/HLA database is used as the reference source for HLA types previously found in humans | - about - reference |
|
SnpEff maintains a database largely derived from ensembl and GRC, which the pipeline uses to annotate variants in terms of coding impact. | license | |
ENCODE database is used for blacklisting regions for structural variant calling. Hela replication timing is also used to annotate the replication timing of structural variant breakends | - about - reference |
Different inputs can lead to variation in cost and runtime, but to give some indication of what to expect, we have benchmarked Platinum against COLO829:
- Reference DNA 30x depth and 4 lanes
- Tumor DNA 100x depth and 4 lanes
- The following minimum quotas (see Scaling Up for more info on Quotas)
Quota | Value |
---|---|
CPU | 768 |
CPU_ALL_REGIONS | 768 |
PREEMPTIBLE_LOCAL_SSD_TOTAL_GB | 9TB |
PERSISTENT_DISK_SSD_GB | 1TB |
With these settings we get a cost of approximately €20 and runtime of 15 hours.
When evaluating your own performance, a few things to keep in mind:
- We map every FASTQ lane to the reference genome in parallel, so consolidating into less lanes (for instance, after converting back from BAM) will increase runtime.
- We use pre-emptible VMs to save cost. These can be pre-empted (stopped and reclaimed) by Google, adding to the total runtime. The pipeline will handle pre-emptions and its well worth it for the cost impact.
- New projects and GCP accounts are constrained by small quotas. You can request to raise them through the console.
Platinum runs on the Google Cloud Platform. To start you'll need:
- A GCP account. You can get started with the credit they offer and a credit card (for verification). See Google's docs.
- A GCP project and a user within that project with the Owner role.
- A region where you plan to store your data and run your workload (hint: pick the region closest to where your data currently resides)
You'll also need a machine to checkout this repository and run Platinum. You should have the following installed and some basic familiarity with how to use them:
- git.
- Java 11.
- gcloud SDK
(configured to access your new project) and the connector module which you can install with
gcloud components install kubectl
. - kubectl
The basic user-facing component of Platinum is a shell script (platinum
) that attempts to simplify interaction with the cluster and the
Platinum software itself. In the simplest cases the following should help you get your job running.
Run the following from the root of this repo where examples/quickstart/colomini.json
is your input file (make sure to adjust the
export
lines):
export PROJECT=$(gcloud projects list | grep 'your project name from above' | awk '{print $1}')
export REGION='your region'
# Experiment name is just a unique id we'll use to name resources. Call it anything for now.
export EXPERIMENT_NAME='experiment_name'
./platinum configure -p $PROJECT -r $REGION
./platinum login
./platinum run -n $EXPERIMENT_NAME -p $PROJECT -r $REGION -i examples/quickstart/colomini.yaml
./platinum status
# Keep checking this until you see the pod is complete. Then cleanup
./platinum cleanup -n $EXPERIMENT_NAME -p $PROJECT -r $REGION
# Results are waiting in Google Cloud Storage
gsutil ls gs://platinum-output-$EXPERIMENT_NAME
See below for advanced usage.
There are a couple of things requiring a one time configuration in your project:
- Enabling private access such that your VMs will not be exposed to the public internet
- Enabling the compute and kubernetes apis
Checkout this repository on your local machine and run the following from the repo root:
./platinum configure -p your_project -r your_region
You only need to run this once for a project and region where you want to run platinum.
You must "login" to GCP locally to configure the credentials platinum needs to work with GCP.
./platinum login
This command performs two logins, once as a user and another time as the application default. This ensures all subsequent operations will use the correct credentials.
You should run this command at least once, and whenever you use different credentials to interact with GCP.
The HMF pipeline takes paired-end FASTQ as input. This input should be uploaded to a bucket in Google Cloud Storage (GCS) before running platinum. Once the input FASTQ is in GCS you define a YAML or JSON configuration in the following format.
In the example below we have one sample. A sample in this context is close to synonymous with a patient or donor, and is a grouping of tumor sequencing data with blood/normal sequencing data.
Each sample can have multiple tumors to a single normal. Note: when platinum runs it will actually run a pipeline for each pair. So in this example, 2 pipeline will run.
Each FASTQ should be a two paths, one to each end of the pair.
samples:
- name: SAMPLE_NAME
tumors:
- name: TUMOR1
fastq:
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L001_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L001_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L002_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L002_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L003_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L003_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L004_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L004_R2_001.fastq.gz"
- name: TUMOR2 #Optional
fastq:
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L001_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L001_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L002_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L002_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L003_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L003_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L004_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003T_AHHKYHDSXX_S12_L004_R2_001.fastq.gz"
normal:
name: NORMAL
fastq:
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003R_AHHKYHDSXX_S13_L001_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003R_AHHKYHDSXX_S13_L001_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003R_AHHKYHDSXX_S13_L002_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003R_AHHKYHDSXX_S13_L002_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003R_AHHKYHDSXX_S13_L003_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003R_AHHKYHDSXX_S13_L003_R2_001.fastq.gz"
- read1: "gs://hmf-public/fastq/COLO829Mini/COLO829v003R_AHHKYHDSXX_S13_L004_R1_001.fastq.gz"
read2: "gs://hmf-public/fastq/COLO829Mini/COLO829v003R_AHHKYHDSXX_S13_L004_R2_001.fastq.gz"
Platinum can also be run from BAM with the following format:
samples:
- name: SAMPLE_NAME
tumors:
- name: TUMOR1
bam: "gs://some/path/to/tumor.bam"
- name: TUMOR2 #Optional
bam: "gs://some/path/to/other_tumor.bam"
normal:
name: NORMAL
bam: "gs://some/path/to/normal.bam"
Alignment is skipped automatically when running from BAM.
Platinum can be run with either a 37 or 38 reference genome release. The default is 38, but to use 37 instead, include these lines at the top of your input file, above the samples object:
argumentOverrides:
ref_genome_version: "37"
Internally Platinum uses the GRCh37 assembly Homo_sapiens.GRCh37.GATK.illumina.fasta
, and the GRCh38 no-alt
assembly GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
.
There is no support for use of other assemblies or versions.
Many use cases will be fine interacting with just the platinum
script but its limitations start to show in some scenarios:
- The
status
subcommand is quite naive and just queries for jobs using the active Kubernetes cluster configuration. If you are interacting with multiple clusters and switching between them, or if your cluster is a multi-tenant arrangement you'll find the output is not really that useful. In that case you can use thekubectl
command directly to isolate your jobs more effectively. - When using
update
, in the background the script just calls out togit
and attempts to pull the latest changes overtop what you have locally. If you've made any modifications this will be immediately obvious. Also there are multiple branches containing Platinum versions that are compatible with different underlying Pipeline5 releases. To run with differentpipeline5
versions you may have to switch to a different branch, which requires basic knowledge ofgit
. This approach allows us to keep shipping updates without worrying about keeping compatibility with old versions forever, while also not marooning users without a working Platinum.
For the purposes of Platinum:
kubectl
is used to interface with Kubernetes from the command line- Platinum submits "jobs" to the Kubernetes cluster
- Each job will spawn a pod to run the associated pipeline, and if there is a failure successive pods to finish
Some useful commands:
kubectl get jobs
kubectl get pods
kubectl logs (pod name)
Some more advanced usages are detailed below.
Platinum uses ephemeral infrastructure to ease any maintenance burden, hide complexity and reduce resource contention. That said, you may want to use existing shared infrastructure for your platinum runs.
To set this up you can pass platinum an existing service account name, cluster name and secret within that cluster which contains the private key for the service account like so:
serviceAccount:
name: "your-service-account@your-service-account.iam.gserviceaccount.com",
existingSecret: "your-secret",
cluster: "your-cluster"
Platinum offers some additional configuration options to suit more complex GCP project setups. These extra settings are mainly geared for
setups requiring additional levels of security, in particular around the network. These settings are configured in the JSON by adding a
sections gcp
gcp:
project: "hmf-crunch",
region: "europe-west4",
network: "kubernetes",
subnet: "kubernetes",
networkTags:
- "tag1"
zones:
- "europe-west4-a"
privateCluster: true,
secondaryRangeNamePods: "pods",
secondaryRangeNameServices: "services",
masterIpv4CidrBlock: "172.17.0.32/28"
Parameter | Description |
---|---|
project | Same as the -p CLI argument. If you specify it here you don't need to put on the command line |
region | Same as the -r CLI argument. If you specify it here you don't need to put on the command line |
outputBucket / Override the output bucket location with an existing bucket | |
network | A identifier to the VPC network to be used for all compute resources. If the network is in a different project from the run, you the "projects/network-project/global/networks/network-name" format |
subnet | A identifier to the VPC network to be used for all compute resources. If the network is in a different project or region from the run, you the "projects/subnet-project/regions/subnet-region/subnetworks/subnet-name" format |
networkTags | Network tags to apply to all compute resources |
zones | A list of zones to use for kubernetes nodes to avoid capacity issues. The pipeline may run outside these zones, but will automatically select a zone with capacity. |
privateCluster | Makes the kubernetes cluster private, ie no nodes or master have public IP. Note that if this option is used, you will not be able to run platinum from a computer outside the VPC. You should create a VM within the VPC to run platinum |
secondaryRangeNamePods | A secondary IP range for pods in the cluster. This setting is only required if you use a shared VPC network. |
secondaryRangeNamePods | A secondary IP range for services in the cluster. This setting is only required if you use a shared VPC network. |
masterIpv4CidrBlock | Passed to the master when private cluster is enabled. Will default to "172.16.0.32/28" so only required if you have multiple private clusters in the same VPC |
preemptibleCluster | Use pre-emptible nodes in the cluster to save cost. Default is true |
Use the following command to run platinum:
./platinum run -n EXPERIMENT_NAME -p PROJECT -r REGION -i examples/quickstart/input.json
This command will read your input json and create a platinum run in the project and region you've specified. EXPERIMENT_NAME should be a unique and meaningful name (no spaces or special chars) which will be used to identify all the cloud resources used for your run.
Platinum is asynchronous, you can keep eye on the progress use the following command:
./platinum status
NAME READY STATUS RESTARTS AGE
cpct12345678-5qb2s 1/1 Running 0 172m
If your cluster is a shared one there may be pods from other jobs listed, in that case use kubectl get pods | grep ...
.
To check the logs of an individual pipeline use the platinum logs
command.
Pauls-MacBook-Pro:platinum pwolfe$ ./platinum logs cpct12345678-5qb2s
2020-08-20 18:00:10 INFO - Version of pipeline5 is [5.14.1742]
...
Once the run is complete, all results will end up in a bucket in your project, named of the format gs://platinum-output-EXPERIMENT_NAME
.
Make sure you clean up when the run is complete, as you now have a small Kubernetes cluster:
./platinum cleanup -n EXPERIMENT_NAME -p PROJECT -r REGION
Using GCP infrastructure, Platinum can run all your pipelines in parallel, giving you the same total runtime with 1000 samples as a single sample. That said, to take advantage your GCP project must have been granted enough quota to support your workload. Here we review the quota limits frequently reached by Platinum and appropriate values to request from google.
First, please review GCP's documentation on Raising Quotas and the request process.
An overview of the key quota limits are defined below. All the peaks occur during the alignment process, which uses many VMs with large core counts. In our 4 ref + 4 tumor lane benchmark, the peak lasts approx. 45 minutes.
Quota | Peak | Description |
---|---|---|
CPU | 96 x # of lanes | Each lane is aligned individually on a 96 core VM. While we use preemptible VMs, CPU count in the selected region is still constrained by this quota. |
CPU_ALL_REGIONS | 96 x # of lanes | This quota is another limit on CPUs, but also includes any CPUs used in other regions |
PREEMPTIBLE_LOCAL_SSD_TOTAL_GB | 1.125 TB x # of lanes | Local SSDs can be attaches to a VM in 375Gb increments. Attaching 3 local SSDs to each VM ensures we have enough space for the input, output and temporary files involved in alignment and somatic calling. |
PERSISTENT_DISK_SSD_GB | 200GB x # of lanes | Used for the O/S of each VM, along with HMF resources and tools |
Getting large quota increases can be difficult if you have a new GCP account without a billing track record. Also, quotas are generally allocated for sustained use, and not the bursty requirements of running a large pipeline. You may need to contact Google in order to explain your requirements. If you are having trouble getting the quotas you need for a large experiment, please reach out to us and we can help put you in touch with the right people.
When running large workloads you may find that Kubernetes cluster has a hard time keeping up, and you end up overwhelming it at startup. This would manifest itself in many "Pending" or "Evicted" jobs. Batching is available to alleviate this issue and there are two forms:
- Time-based, which is configured with a size and a delay. It submits "size" number of jobs every "delay" minutes. It is still supported but has been deprecated.
- Constant-size, which only takes a size. It tries to keep the number of running jobs at the "size" number, but also has logic to rate-limit job submission to avoid overwhelming the cluster.
The constant-size scheduler is preferred. It works very will with modern autoscale-enabled clusters and has the pleasant advantage of allowing runtime throughput management but tuning autoscale parameters while the job is already running. For large workloads it is recommended to set a large batch size in Platinum and adjust actual cluster size after Platinum is running to achieve the desired throughput.
batch:
size: 50
delay: 10
The delay is in minutes, and if it is provided the time-based scheduler will be used, otherwise the constant-size scheduler is configured. Time-based scheduling has been deprecated to be removed in a later release.
Platinum can also re-use the output of a complete run to facilitate running new version on old data. When re-run platinum will leave input data in place, but replace existing data when new output is available. You may want to make a backup of your initial data before running again.
To configure this add the following to your JSON and be sure to use the same project and experiment name when running platinum.
argumentOverrides:
starting_point: "calling_complete"
See HMF cancer analysis pipeline for a current list of available starting points. |