The code base is adapted from the benchmarking code of Pangenie published in Nature Genetics doi: 10.1038/s41588-022-01043-w
- pangenie (https://github.com/eblerjana/pangenie)
- Graphtyper(https://github.com/DecodeGenetics/graphtyper)
- paragraph (https://github.com/Illumina/paragraph)
- Gatk (https://gatk.broadinstitute.org/hc/en-us)
- Reference Genome(GRCh38/hg38) https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
- Download Haploid resolved Assemblies:
# Download HG002
wget ftp://ftp.dfci.harvard.edu/pub/hli/whdenovo/asm/NA24385-denovo-H1.fa.gz
wget ftp://ftp.dfci.harvard.edu/pub/hli/whdenovo/asm/NA24385-denovo-H2.fa.gz
#Download HG00731
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/working/20200417_Marschall-Eichler_NBT_hap-assm/HG00731_hgsvc_pbsq2-ccs_1000-pereg.h1-un.racon-p2.fasta
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/working/20200417_Marschall-Eichler_NBT_hap-assm/HG00731_hgsvc_pbsq2-ccs_1000-pereg.h2-un.racon-p2.fasta
-
Call Phased Variants from haploid resolved assemblies using this Snakemake workflow to call phased variants: https://bitbucket.org/jana_ebler/vcf-merging/src/master/
-
Download Short read sample for HG00731.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR324/004/ERR3241754/ERR3241754_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR324/004/ERR3241754/ERR3241754_2.fastq.gz
-
Subsample the Short reads to 5,10,20, and 30x coverage, and map all subsamples to the reference genome using bwa.
-
Build CCDG using all the subsamples using snakemake script in build_test_CCDG/
- edit config.json to specify the outputfolder and kmersize(we used 31)
- Enter the pathes of the subsampled fastq files in subsample_table.csv
- run 'snakemake -j8 --use-conda'
The benchmarking snakemake workflow is in benchmark subfolder. We first need to edit config.json to configure the pathes of the input and output data. run the workflow using 'snakemake -j8 --use-conda'