Skip to content

Latest commit

 

History

History
37 lines (28 loc) · 4.99 KB

deepvariant-details-training-data.md

File metadata and controls

37 lines (28 loc) · 4.99 KB

DeepVariant training data

WGS training data v0.7

We used: 10 HG001 PCR-free, 2 HG005 PCR-free, 4 HG001 PCR+ for training.

Among these 16 BAM files, 6 of them are from public sources:

BAM file (--reads) PCR-free? FASTA file (--ref) Truth VCF (--truth_variants) BED file (--confident_regions)
HG001-NA12878-pFDA.merged.sorted.bam(1) Yes GRCh38_Verily_v1.genome.fa NISTv3.3.2/GRCh38 NISTv3.3.2/GRCh38
NA12878D_HiSeqX_R1.deduplicated.bam(2) No hs37d5.fa NISTv3.3.2/GRCh37 NISTv3.3.2/GRCh37
NA12878J_HiSeqX_R1.deduplicated.bam(2) No hs37d5.fa NISTv3.3.2/GRCh37 NISTv3.3.2/GRCh37
NA12878-Rep01_S1_L001_001_markdup.bam(2) No hs37d5.fa NISTv3.3.2/GRCh37 NISTv3.3.2/GRCh37
N3C9-2plex1-L1-171212B-NA12878-1_S1_L001_001_markdup.bam(3) Yes hs37d5.fa NISTv3.3.2/GRCh37 NISTv3.3.2/GRCh37
NexteraFlex-2plex1-L1-NA12878-1_S1_L001_001_markdup.bam(4) No hs37d5.fa NISTv3.3.2/GRCh37 NISTv3.3.2/GRCh37

(1): FASTQ files from Precision FDA Truth Challenge.

(2): BAM files provided by DNAnexus.

(3): FASTQ files from BaseSpace public data: NovaSeq S1 Xp: TruSeq Nano 350 (Replicates of NA12878)/Samples/N3C9_2plex1_L1_171212B_NA12878-1/Files/N3C9-2plex1-L1-171212B-NA12878-1_S1_L001_R1_001.fastq.gz and N3C9-2plex1-L1-171212B-NA12878-1_S1_L001_R2_001.fastq.gz

(4): FASTQ files from BaseSpace public data: NovaSeq S1 Xp: Nextera DNA Flex (Replicates of NA12878)/Samples/NexteraFlex_2plex1_L1_NA12878-1/Files/NexteraFlex-2plex1-L1-NA12878-1_S1_L001_R1_001.fastq.gz and NexteraFlex-2plex1-L1-NA12878-1_S1_L001_R2_001.fastq.gz

We generated our own BAM files using BWA-MEM to map the reads to the reference, and sorts the output. We also mark duplicated reads.