Skip to content

Latest commit

 

History

History

curation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Curation Scripts

The scripts in this portion of the repository were used to ingest, reshape, and store variant annotations in a cloud analysis-ready format. You can use them to bring in a fresh copy of an annotation resource or as a starting point for curation of a new annotation resource.

Status of this sub-project

This code currently works with annotation resources such as dbSNP and ClinVar along with variant allele frequencies from NHLBI GO Exome Sequencing Project (ESP), 1000 Genomes, ExAC, and Genome Aggregation Database (gnomAD) but similar techniques could be applied to other annotation resources.

All steps are run in the cloud, but each individual step is launched manually.

Overview

Curate Individual Annotation Sources

Many variant annotation sources are encoded as VCF files. Therefore we can use Google Genomics to import the resource and export it to BigQuery.

Follow the tutorial to run a dsub script to create individual tables holding dbSNP, ClinVar, ESP, etc.

Create an "All-Possible SNPs" Table

A table with annotations for all possible SNPs of a particular genome reference is useful for:

  • Examining SNP variation across different regions of the genome.
  • Quickly annotating the SNPs for a cohort using a simple JOIN.
  • Generating synthetic sequence variant datasets using the SNP allele frequencies from this table.

Follow the tutorial to create an all-possible-SNPs tables for build 38 of the human genome reference.

Add Column Descriptions to a BigQuery Table

The variants table generated by performing an export from Google Genomics does not include the field descriptions for the fields.

See add BigQuery descriptions for instructions on how to automatically populate the BigQuery schema description with the information from the VCF header.