Skip to content

Latest commit

 

History

History
196 lines (162 loc) · 6.71 KB

investigate-edits.md

File metadata and controls

196 lines (162 loc) · 6.71 KB

Invesigate-edits

Commands used to investigate crispr edits in lines of strawberry.

Data transfer

Samples sequenced were:

  RawDatDir=/data/seq_data/miseq/2018/RAW/180629_M04465_0081_000000000-D4LBH/Data/Intensities/BaseCalls/
  ls $RawDatDir

Sample names referred to the following:

Loci: PDS74

Raw sequencing data was symbolicly linked to the working directory:

  RawDatDir=/data/seq_data/miseq/2018/RAW/180629_M04465_0081_000000000-D4LBH/Data/Intensities/BaseCalls
  ProjectDir=/home/groups/harrisonlab/project_files/strawberry_crispr
  PlateDesign=$(ls raw_data/plate_design.txt)
  for ReadF in $(ls $RawDatDir/*_L001_R1_001.fastq.gz | grep -v 'Undetermined'); do
    ReadR=$(echo $ReadF | sed 's/_L001_R1_001.fastq.gz/_L001_R2_001.fastq.gz/g')
    Name=$(basename $ReadF _L001_R1_001.fastq.gz)
    Well=$(echo $Name | cut -f1 -d '_')
    Row=$(echo $Well | head -c 1)
    Column=$(echo $Well | tail -c +2)
    # Set Locus
    Locus=PDS74
    # Set cultivar
    Cultivar=$(cat $PlateDesign | grep -P "^${Row}\t${Column}" | cut -f3 | sed 's/ /_/g')
    # Set Line
    Line=$(cat $PlateDesign | grep -P "^${Row}\t${Column}" | cut -f4)
    # Set SampleID
    SampleID=$(cat $PlateDesign | grep -P "^${Row}\t${Column}" | cut -f5)
    printf "$Row\t$Column\t$Cultivar\t$Line\t$SampleID\n"
    OutDir=$ProjectDir/raw_dna/paired/$Locus/$Cultivar/$Line/$SampleID
    mkdir -p $OutDir/F
    cp -s $ReadF $OutDir/F/.
    mkdir -p $OutDir/R
    cp -s $ReadR $OutDir/R/.
  done

Metabarcoding analysis

These commands are based upon the workflow described at: https://github.com/eastmallingresearch/Metabarcoding_pipeline

Set pipeline variables

# set MBPL variable to pipeline folder
MBPL=/home/deakig/metabarcoding_pipeline

Analysis of individual loci

Demultiplexing

This script demultiplexs mixed (e.g. ITS and 16S) libraries based on the primer sequence. Any sequence which has mismatches is written to ambiguous.fq (f & r seperately). Primer sequences are detailed in the submission wrapper. Note Regex are used to describe degenerate bases in the primer.

Run below to demultiplex:

  for DataDir in $(ls -d raw_dna/paired/*/*/*/*); do
    Jobs=$(qstat | grep 'submit_dem' | grep 'qw'| wc -l)
    while [ $Jobs -gt 5 ]; do
    sleep 1m
    printf "."
    Jobs=$(qstat | grep 'submit_dem' | grep 'qw'| wc -l)
    done
    printf "\n"
    WorkDir=/data/scratch/armita/fusarium_ampseq
    R1=$(ls $DataDir/F/*.fastq.gz)
    R2=$(ls $DataDir/R/*.fastq.gz)
    echo $DataDir
    echo $(basename $R1)
    echo $(basename $R2)
    ProgDir=/home/armita/git_repos/emr_repos/scripts/strawberry_crispr/scripts
    OutDir=demulti_dna/$(echo $R1 | rev | cut -f3,4,5,6 -d '/' | rev)
    qsub $ProgDir/submit_demulti.sh $R1 $R2 ${OutDir}
  done

Summise reads demultiplexed:

printf "RunName\tLocus/Cultivar/Line/SampleID\tPDS74\tAmbiguous\n" > demulti_dna/demultiplex_summary.tsv
for RunDir in $(ls -d demulti_dna/*/*/*/*); do
  Locus=$(echo $RunDir | rev | cut -f4 -d '/' | rev)
  Cultivar=$(echo $RunDir | rev | cut -f3 -d '/' | rev)
  Line=$(echo $RunDir | rev | cut -f2 -d '/' | rev)
  SampleID=$(echo $RunDir | rev | cut -f1 -d '/' | rev)
  RunName=$(basename $RunDir/*_ambiguous.fq | sed 's/_ambiguous.fq//g')
  Pds74=$(ls $RunDir/*PDS74.fq)
  Pds74Lines=$(cat $Pds74 | wc -l)
  Ambiguous=$(ls $RunDir/*ambiguous.fq)
  AmbLines=$(cat $Ambiguous | wc -l)
  printf "$RunName\t$Locus\t$Pool\t$Dilution\t$Rep\t$Pds74Lines\t$AmbLines\n"
done >> demulti_dna/demultiplex_summary.tsv

Pre-processing

Script will join PE reads (with a maximum % difference in overlap) remove adapter contamination and filter on minimum size and quality threshold. Unfiltered joined reads are saved to unfiltered folder, filtered reads are saved to filtered folder.

for DataDir in $(ls -d demulti_dna/*/*/*/*); do
  Jobs=$(qstat | grep 'sub_pro' | grep 'qw'| wc -l)
  while [ $Jobs -gt 5 ]; do
  sleep 20s
  printf "."
  Jobs=$(qstat | grep 'sub_pro' | grep 'qw'| wc -l)
  done
  printf "\n"
  Locus=$(echo $DataDir | cut -f2 -d '/')
  Prefix=$(echo $DataDir | cut -f2,3,4,5 -d '/' | sed 's&/&_&g')
  WorkDir=/data/scratch/armita/fusarium_ampseq
  R1=$(ls $DataDir/*_${Locus}.fq | head -n1 | tail -n1)
  R2=$(ls $DataDir/*_${Locus}.fq | head -n2 | tail -n1)
  echo $DataDir
  echo $(basename $R1)
  echo $(basename $R2)
  OutDir="processed_dna/"$(echo $R1 | rev | cut -f2,3,4,5 -d '/' | rev)
  ProgDir=/home/armita/git_repos/emr_repos/scripts/strawberry_crispr/scripts
  qsub $ProgDir/sub_process_reads.sh $R1 $R2 $Locus $Prefix $OutDir
done

OTU assignment

This is mostly a UPARSE pipeline, but usearch (free version) runs out of memory for dereplication and subsequent steps. I've written my own scripts to do the dereplication and sorting

  • All read files associated with the each locus in the project are concatenated
  • Clustering run on all the data for this locus within the project
  • Quantification can then be performed for each treatment against the total set
  # Concatenate files
  for Locus in "PDS74"; do
    OutDir=clustering/$Locus
    mkdir -p $OutDir
    cat processed_dna/$Locus/*/*/*/filtered/*.filtered.fa > $OutDir/${Locus}_concatenated.temp.fa
    ls -lh $OutDir/${Locus}_concatenated.temp.fa
    # Submit clustering
    ProgDir=/home/armita/git_repos/emr_repos/scripts/fusarium_ampseq/scripts
    qsub $ProgDir/submit_clustering.sh $OutDir/${Locus}_concatenated.temp.fa $OutDir $Locus
  done

Quantify variants

Create OTU tables

for Locus in "PDS74"; do
  for Cultivar in Hawaii_4 Calypso; do
  # for Pool in Fusarium_spp; do
    OutDir=quantified/$Locus/$Cultivar
    mkdir -p $OutDir
    cat processed_dna/$Locus/$Cultivar/*/*/merged/*.fa | cut -f1 -d '.' > $OutDir/${Locus}_reads_appended.fa
    QueryReads=$(ls $OutDir/${Locus}_reads_appended.fa)
    for OtuType in zOTUs; do
      RefDb=$(ls clustering/PDS74/PDS74_zOTUs.fa)
      Prefix=$Locus
      ProgDir=/home/armita/git_repos/emr_repos/scripts/fusarium_ampseq/scripts
      qsub $ProgDir/submit_quantification.sh $QueryReads $RefDb $OtuType $Prefix $OutDir
    done
  done
done