From 0289bc1944cf7ff3f8566ccbe173365de5306b0f Mon Sep 17 00:00:00 2001 From: Geert van Geest Date: Tue, 10 Oct 2023 09:48:27 +0200 Subject: [PATCH] Deployed 3b141bd with MkDocs version: 1.4.1 --- 404.html | 14 +++ course_material/day1/dockerfiles/index.html | 14 +++ .../day1/introduction_containers/index.html | 14 +++ .../day1/managing_docker/index.html | 14 +++ course_material/day1/singularity/index.html | 14 +++ course_material/day2/1_guidelines/index.html | 14 +++ .../day2/2_introduction_snakemake/index.html | 68 +++++++------ .../day2/3_generalising_snakemake/index.html | 84 ++++++++-------- .../day2/4_decorating_workflow/index.html | 91 ++++++++++++------ course_schedule/index.html | 16 ++- index.html | 14 +++ precourse/index.html | 14 +++ search/search_index.json | 2 +- sitemap.xml | 22 ++--- sitemap.xml.gz | Bin 204 -> 203 bytes 15 files changed, 283 insertions(+), 112 deletions(-) diff --git a/404.html b/404.html index bef9d18..3b9c3ab 100644 --- a/404.html +++ b/404.html @@ -427,6 +427,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + diff --git a/course_material/day1/dockerfiles/index.html b/course_material/day1/dockerfiles/index.html index 2d46fd7..7f8325f 100644 --- a/course_material/day1/dockerfiles/index.html +++ b/course_material/day1/dockerfiles/index.html @@ -533,6 +533,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + diff --git a/course_material/day1/introduction_containers/index.html b/course_material/day1/introduction_containers/index.html index 4329f9d..66fd328 100644 --- a/course_material/day1/introduction_containers/index.html +++ b/course_material/day1/introduction_containers/index.html @@ -485,6 +485,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + diff --git a/course_material/day1/managing_docker/index.html b/course_material/day1/managing_docker/index.html index ff75cf6..fd3b0b1 100644 --- a/course_material/day1/managing_docker/index.html +++ b/course_material/day1/managing_docker/index.html @@ -540,6 +540,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + diff --git a/course_material/day1/singularity/index.html b/course_material/day1/singularity/index.html index 70514df..f829433 100644 --- a/course_material/day1/singularity/index.html +++ b/course_material/day1/singularity/index.html @@ -533,6 +533,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + diff --git a/course_material/day2/1_guidelines/index.html b/course_material/day2/1_guidelines/index.html index 349cee2..9518802 100644 --- a/course_material/day2/1_guidelines/index.html +++ b/course_material/day2/1_guidelines/index.html @@ -485,6 +485,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + diff --git a/course_material/day2/2_introduction_snakemake/index.html b/course_material/day2/2_introduction_snakemake/index.html index 1c8873d..0019ce3 100644 --- a/course_material/day2/2_introduction_snakemake/index.html +++ b/course_material/day2/2_introduction_snakemake/index.html @@ -553,6 +553,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + @@ -742,27 +756,20 @@

    Executing a workflow with a
  • Check the output content: cat results/first_step.txt
  • -

    Note that during the execution of the workflow, Snakemake automatically created the missing folder (results/) in the output path. If several folders are missing (for example, here, test1/test2/test3/first_step.txt), Snakemake will create all of them.

    +

    Note that during the execution of the workflow, Snakemake automatically created the missing folder (results/) in the output path. If several folders are missing (for example, test1/test2/test3/first_step.txt), Snakemake will create all of them.

    Exercise: Re-run the exact same command. What happens?

    Answer -
    - -
    Nothing! We get a message saying that Snakemake did not run anything:
    -
    -```
    -Building DAG of jobs...
    +

    Nothing! We get a message saying that Snakemake did not run anything:

    +
    Building DAG of jobs...
     Nothing to be done (all requested files are present and up to date).
    -```
    -
    -By default, Snakemake only runs a job if:
    -* A target file explicitly requested in the `snakemake` command is missing
    -* An intermediate file is missing and is required produce a target file
    -* It notices input files newer than output files, based on file modification dates. In this case, Snakemake will generate again the existing outputs.
    -
    -We can change this behaviour and force the re-run of a specific target by using the `-f` option: `snakemake --cores 1 -f results/first_step.txt` or force recreate ALL the outputs of the workflow using the `-F` option: `snakemake --cores 1 -F`. In practice, we can also alter Snakemake (re-)run policy, but we will not cover this topic in the course (see [--rerun-triggers option](https://snakemake.readthedocs.io/en/stable/executing/cli.html) in Snakemake's CLI help and [this git issue](https://github.com/snakemake/snakemake/issues/1694) for more information).
     
    - +

    By default, Snakemake only runs a job if: +* A target file explicitly requested in the snakemake command is missing +* An intermediate file is missing and is required produce a target file +* It notices input files newer than output files, based on file modification dates. In this case, Snakemake will generate again the existing outputs.

    +

    We can change this behaviour and force the re-run of a specific target by using the -f option: snakemake --cores 1 -f results/first_step.txt or force recreate ALL the outputs of the workflow using the -F option: snakemake --cores 1 -F. In practice, we can also alter Snakemake (re-)run policy, but we will not cover this topic in the course (see –rerun-triggers option in Snakemake’s CLI help and this git issue for more information).

    +

    In the previous example, the values of the two rule directives are strings. For the shell directive (we will see other types of directive values later in the course), long string can be written on multiple lines for clarity, simply using a set of quotes for each line:

    rule first_step:
         output:
    @@ -798,27 +805,24 @@ 

    Creating a workflow with several

    Exercise: Delete the results/ folder, copy the two previous rules (first_step and second_step) in the same Snakefile (place the first_step rule first) and try to run the workflow without specifying an output. What happens?

    Answer +
      +
    • Delete the results folder: using the graphic interface or rm -rf results/
    • +
    • Execute the workflow without output: snakemake --cores 1
    • +
    +

    Only the first output, results/first_step.txt, is created. During its execution, Snakemake tries to generate a specific output called target and resolve all dependencies based on this target. A target can be any output that can be generated by any rule in the workflow. When you do not specify a target, the default one is the output of the first rule in the Snakefile, here results/first_step.txt of rule first_step.

    - -
    * Delete the `results` folder: using the graphic interface or `rm -rf results/`
    -* Execute the workflow without output: `snakemake --cores 1`
    -
    -Only the first output, `results/first_step.txt`, is created. During its execution, Snakemake tries to generate a specific output called **target** and resolve all dependencies based on this target. A target can be any output that can be generated by any rule in the workflow. When you do not specify a target, the default one is the output of the first rule in the Snakefile, here `results/first_step.txt` of `rule first_step`.
    -
    -

    Exercise: With this in mind, instead of one target, use a space-separated list of targets in your command, to generate multiple targets. Use the -F to force the re-run of the whole workflow or delete your results/ folder beforehand.

    Answer +
      +
    • Delete the results folder: using the graphic interface or rm -rf results/
    • +
    • Execute the workflow with multiple targets: snakemake --cores 1 results/first_step.txt results/second_step.txt
    • +
    +

    We should now see Snakemake execute the 2 rules and produce both targets/outputs.

    - -
    * Delete the `results` folder: using the graphic interface or `rm -rf results/`
    -* Execute the workflow with multiple targets: `snakemake --cores 1 results/first_step.txt results/second_step.txt`
    -
    -We should now see Snakemake execute the 2 rules and produce both targets/outputs.
    -
    -

    Chaining rules

    -

    Once again, writing all the outputs in the snakemake command does not look like a good solution: it is very time-consuming, error-prone (and annoying)! Imagine what happens when your workflow generate tens of outputs?! Fortunately, there is a way to simplify this, which relies on rules dependency. The core principle of Snakemake’s execution is to compute a Directed Acyclic Graph (DAG) that summarizes dependencies between all the inputs and outputs required to generate the final desired outputs. For each job, starting from the jobs generating the final outputs, Snakemake checks if the required inputs exist. If they do not, the software looks for a rule that generates these inputs. This process is repeated until all dependencies are resolved. This is why Snakemake is said to have a ‘bottom-up’ approach: it starts from the last outputs and go back to the first inputs.

    +

    Once again, writing all the outputs in the snakemake command does not look like a good solution: it is very time-consuming, error-prone (and annoying)! Imagine what happens when your workflow generate tens of outputs?! Fortunately, there is a way to simplify this, which relies on rules dependency.

    +

    The core principle of Snakemake’s execution is to compute a Directed Acyclic Graph (DAG) that summarizes dependencies between all the inputs and outputs required to generate the final desired outputs. For each job, starting from the jobs generating the final outputs, Snakemake checks if the required inputs exist. If they do not, the software looks for a rule that generates these inputs. This process is repeated until all dependencies are resolved. This is why Snakemake is said to have a ‘bottom-up’ approach: it starts from the last outputs and go back to the first inputs.

    Hint

    Your Snakefile should look like this:

    @@ -845,7 +849,7 @@

    Chaining rules

  • Execute the workflow: snakemake --cores 1 results/second_step.txt
  • Visualise the content of the results folder: ls -alh results/
  • -

    We should now see Snakemake executing the 2 rules and producing both outputs. To generate the output results/second_step.txt, Snakemake requires the input results/first_step.txt. Before the workflow is executed, this file does not exist, therefore, Snakemake looks for a rule that generates results/first_step.txt, in this case the rule first_step. The process is then repeated for first_step. In this case, the rule does not require any input, so all dependencies are resolved, and Snakemake can generate the DAG.

    +

    You should now see Snakemake executing the two rules and producing both outputs. To generate the output results/second_step.txt, Snakemake requires the input results/first_step.txt. Before the workflow is executed, this file does not exist, therefore, Snakemake looks for a rule that generates results/first_step.txt, in this case the rule first_step. The process is then repeated for first_step. In this case, the rule does not require any input, so all dependencies are resolved, and Snakemake can generate the DAG.

    Important notes on rules dependency

    Rules must produce unique outputs

    diff --git a/course_material/day2/3_generalising_snakemake/index.html b/course_material/day2/3_generalising_snakemake/index.html index 1918e7f..92fc350 100644 --- a/course_material/day2/3_generalising_snakemake/index.html +++ b/course_material/day2/3_generalising_snakemake/index.html @@ -626,6 +626,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + @@ -849,8 +863,7 @@

    Learning outcomes

  • Visualise a workflow DAG
  • Data origin

    - -

    The data we will use during the exercises was produced in this work. Briefly, the team studied the transcriptional response of a strain of baker’s yeast, Saccharomyces cerevisiae, facing environments with different amount of CO2. To this end, they performed 150 bp paired-end sequencing of mRNA-enriched samples. Detailed information on all the samples are available here, but just know that for the purpose of the course, we selected 6 samples (3 replicates per condition, low and high CO2) and down-sampled them to 1 million read-pairs each to reduce computation times.

    +

    The data we will use during the exercises was produced in this work. Briefly, the team studied the transcriptional response of a strain of baker’s yeast, Saccharomyces cerevisiae, facing environments with different amount of CO2. To this end, they performed 150 bp paired-end sequencing of mRNA-enriched samples. Detailed information on all the samples are available here, but just know that for the purpose of the course, we selected 6 samples (3 replicates per condition, low and high CO2) and down-sampled them to 1 million read-pairs each to reduce computation times.

    Exercises

    One of the aims of today’s course is to develop a basic, yet efficient, workflow to analyse RNAseq data. This workflow takes reads coming from RNA sequencing as inputs and outputs a list of genes that are differentially expressed between two conditions. The files containing the reads are in FASTQ format and the output will be a tab-separated file containing a list of genes with expression changes, results of statistical tests…

    In this series of exercises, we will create the ‘backbone’ of the workflow, i.e. the rules that are the most computationally expensive, namely:

    @@ -861,11 +874,11 @@

    Exercises

  • A rule to count the reads mapping on each gene
  • At the end of this series of exercises, the DAG of your workflow should look like this:

    -
    - +
    + ![backbone_rulegraph](../../../assets/images/backbone_rulegraph.png) +
    Rulegraph of the workflow at
    the end of the session
    -

    Designing and debugging a workflow

    If you have problems designing your Snakemake workflow or debugging it, you can find some help here.

    @@ -933,7 +946,7 @@

    Downloading

    For now, the main thing to remember is that the workflow code goes into a subfolder called workflow and the rest is mostly input/output files, except for the config subfolder, which will be explained later. All output files generated in the workflow should be stored under results/.

    Now, let’s download the data, uncompress it and build the first part of the directory structure.

    -
    wget https://apollo.vital-it.ch/trackvis/snakemake_rnaseq.tar.gz  # Download the data # AT. Check URL to download file
    +
    wget https://apollo.vital-it.ch/trackvis/snakemake_rnaseq.tar.gz  # Download the data
     tar -xvf snakemake_rnaseq.tar.gz  # Uncompress the archive
     rm snakemake_rnaseq.tar.gz  # Delete the archive
     cd snakemake_rnaseq/  # Start developing in a new folder
    @@ -961,7 +974,6 @@ 

    Creating a rule to trim reads

    In theory, trimming also removes sequencing adapters, but we will not do it here to keep computation time low and avoid having to parse other files to extract the adapter sequences.

    Exercise: Implement a rule to trim the reads contained in .fastq files using atropos.

    -

    Hint

      @@ -1002,15 +1014,18 @@

      Creating a rule to trim reads

      shell: ''' echo "Trimming reads in <{input.reads1}> and <{input.reads2}>" > {log} - atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a "A{{20}}" -A "A{{20}}" -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} + atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 \ + --no-cache-adapters -a "A{{20}}" -A "A{{20}}" \ + -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo "Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively" >> {log} echo "Trimming report saved in <{log}>" >> {log} '''
    -

    Note 2 things that are happening here:

    +

    Note the three things that are happening here:

      -
    1. We used the {sample} wildcards twice in the output paths. This is because we prefer to have all the files linked to a sample in the same directory
    2. +
    3. We used the {sample} wildcards twice in the output paths. This is because we prefer to have all the files linked to a sample in the same directory
    4. We added a memory limit for this job: 500 MB. Because we have limited resources in this server compared to a High Performance Computing cluster (HPC), this will help Snakemake to better allocate resources and parallelise jobs. You can determine the maximum amount of memory used by a rule thanks to the max_rss column in a benchmark result (results are shown in MB). More information here
    5. +
    6. We used a backslash \ to split a very long line in smaller lines. This is purely ‘cosmetic’, to avoid very long lines that are painful to read, copy…
    @@ -1024,8 +1039,8 @@

    Creating a rule to trim reads

    Exercise: If you had to run the workflow by specifying only one output, what command would you use?

    Answer -

    snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_atropos_trimmed_1.fastq -If you run it now, don’t forget to have a look at the log and benchmark files!

    +

    snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_atropos_trimmed_1.fastq

    +

    If you run it now, don’t forget to have a look at the log and benchmark files!

    atropos options

      @@ -1048,7 +1063,6 @@

      Creating a r

      To align reads on a genome, HISAT2 relies on a graph-based index. We built the genome index for you, using the command hisat2-build -p 24 -f Scerevisiae.fasta resources/genome_indices/Scerevisiae_index.
      -p is the number of threads to use, -f is the genomic sequence in FASTA format and Scerevisiae_genome_index is the global name shared by all the index files.

    -

    Hint

      @@ -1066,7 +1080,6 @@

      Creating a r

    Please give it a try before looking at the answer!

    -
    Answer
    rule read_mapping:
    @@ -1088,7 +1101,9 @@ 

    Creating a r shell: ''' echo "Mapping the reads" > {log} - hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x resources/genome_indices/Scerevisiae_genome_index -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} + hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \ + -x resources/genome_indices/Scerevisiae_genome_index \ + -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo "Mapped reads saved in <{output.sam}>" >> {log} echo "Mapping report saved in <{output.report}>" >> {log} ''' @@ -1097,8 +1112,8 @@

    Creating a r

    Exercise: If you had to run the workflow by specifying only one output, what command would you use?

    Answer -

    snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads.sam -If you run it now, don’t forget to have a look at the log and benchmark files!

    +

    snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads.sam

    +

    If you run it now, don’t forget to have a look at the log and benchmark files!

    HISAT2 options

      @@ -1122,7 +1137,6 @@

      Creating a rule to
    • Sort the BAM files using Samtools
    • Index the sorted BAM files using Samtools
    • -

      Hint

        @@ -1142,7 +1156,6 @@

        Creating a rule to

      Please give it a try before looking at the answer!

      -
      Answer
      rule sam_to_bam:
      @@ -1176,8 +1189,8 @@ 

      Creating a rule to

      Exercise: If you had to run the workflow by specifying only one output, what command would you use?

      Answer -

      snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads_sorted.bam -If you run it now, don’t forget to have a look at the log and benchmark files!

      +

      snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads_sorted.bam

      +

      If you run it now, don’t forget to have a look at the log and benchmark files!

      Samtools options

        @@ -1207,7 +1220,6 @@

        Creating a rule to count mapped r

        If you are working with genome sequences and annotations from different sources, remember that they must contain the chromosome names, otherwise counting will not work.

      Exercise: Implement a rule to count the reads mapping on each gene of the S. cerevisiae genome using featureCounts.

      -

      Hint

        @@ -1229,7 +1241,6 @@

        Creating a rule to count mapped r

      Please give it a try before looking at the answer!

      -
      Answer
      rule reads_quantification_genes:
      @@ -1251,7 +1262,8 @@ 

      Creating a rule to count mapped r shell: ''' echo "Counting reads mapping on genes in <{input.bam_once_sorted}>" > {log} - featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a resources/Scerevisiae.gtf -o {output.gene_level} {input.bam_once_sorted} &>> {log} + featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \ + -a resources/Scerevisiae.gtf -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo "Renaming output files" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo "Results saved in <{output.gene_level}>" >> {log} @@ -1281,8 +1293,9 @@

      Running the workflow

      Exercise: If you have not done it after each step, it is now time to run the entire workflow on your sample of choice. What command will you use to run it?

      Answer -

      Because all the rules are chained together, you only need to specify one of the final outputs to trigger the execution of all the previous rules: snakemake --cores 1 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv. -You can add the -F flag to force an entire re-run. The entire run should take about ~10 min to complete.

      +

      Because all the rules are chained together, you only need to specify one of the final outputs to trigger the execution of all the previous rules:

      +

      snakemake --cores 1 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv.

      +

      You can add the -F flag to force an entire re-run. The entire run should take about ~10 min to complete.

      Exercise: Check Snakemake’s log in .snakemake/log/. Is everything as you expected, especially the wildcards values, input and output names etc…?

      @@ -1292,7 +1305,6 @@

      Running the workflow

      Visualising the DAG of the workflow

      We have now implemented and run the main steps of our workflow. It is always a good idea to visualise the whole process to check for errors and inconsistencies. Snakemake’s has a built-in workflow visualisation feature to do this.

      Exercise: Visualise the entire workflow’s Directed Acyclic Graph using the --dag flag. Do you need to specify a target?

      -

      Hint

        @@ -1301,10 +1313,10 @@

        Visualising the DAG of the workflow
      • Save the result as a PNG picture
      -
      Answer -

      If we run the command without target: snakemake --cores 1 --dag -F | dot -Tpng > images/dag.png, we will get a Target rules may not contain wildcards. error, which means we need to add a target. Same as before, it makes sense to use one of the final outputs to get the entire workflow: snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png. But once again, we will get an error: BrokenPipeError: [Errno 32] Broken pipe. This is because we are piping the command output to a folder (images/) that does not exist yet The folder is not created by Snakemake because it isn’t handled as part of an actual run. So we have to create the folder before generating the DAG:

      +

      If we run the command without target: snakemake --cores 1 --dag -F | dot -Tpng > images/dag.png, we will get a Target rules may not contain wildcards. error, which means we need to add a target. Same as before, it makes sense to use one of the final outputs to get the entire workflow: snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png.

      +

      But once again, we will get an error: BrokenPipeError: [Errno 32] Broken pipe. This is because we are piping the command output to a folder (images/) that does not exist yet The folder is not created by Snakemake because it isn’t handled as part of an actual run. So we have to create the folder before generating the DAG:

      mkdir images
       snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png
       
      @@ -1337,16 +1349,13 @@

      Visualising the DAG of the workflow
    • Generate the filegraph: snakemake --cores 1 --filegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tjpg > images/filegraph.jpg
    -

    You should obtain the 3 following figures, respectively DAG, rulegraph and filegraph:

    -
    - -    - -    - +

    You should obtain the 3 following figures:

    +
    + ![backbone_dag](../../../assets/images/backbone_dag.png){ width="30%" height="450" } ![backbone_rulegraph](../../../assets/images/backbone_rulegraph.png){ width="30%" height="450" } ![backbone_filegraph](../../../assets/images/backbone_filegraph.png){ width="30%" height="450" } +
    DAG, rulegraph and filegraph (respectively) of the workflow
    at the end of the session
    -

    The differences are:

    +

    The differences between these plots are:

    • --dag: dependency graph of all the jobs
    • --filegraph: dependency graph of rules with inputs and outputs (rule appears once, with wildcards)
    • @@ -1372,7 +1381,6 @@

      Debugging a workflow

      It is very likely you will see bugs and errors the first time you try to run a new Snakefile: don’t be discouraged, this is normal!

      Order of operations in Snakemake

      The topic was tackled when DAGs were mentioned, but to efficiently debug a workflow, it is worth taking a deeper look at what Snakemake does when you execute the command snakemake --cores 1 <target>. There are 3 main phases:

      -
      1. Prepare to run:
        1. Read all the rule definitions from the Snakefile
        2. diff --git a/course_material/day2/4_decorating_workflow/index.html b/course_material/day2/4_decorating_workflow/index.html index c50caa9..f0e42a1 100644 --- a/course_material/day2/4_decorating_workflow/index.html +++ b/course_material/day2/4_decorating_workflow/index.html @@ -586,6 +586,20 @@ + + + + + +
        3. + + Running containers with singularity + +
        4. + + + +
    @@ -783,17 +797,15 @@

    Optimising a workflow by multi
  • Identify in each software the parameter that controls multi-threading
  • Implement the multi-threading
  • -

    Hint

    • Check the software documentation and parameters with the -h/--help flags
    • -
    • Remember that multi-threading only applies to software that can make use of a threads parameters, Snakemake itself cannot parallelize a software automatically
    • +
    • Remember that multi-threading only applies to software that can make use of a threads parameters, Snakemake itself cannot parallelise a software automatically
    • Remember that you need to add threads to the Snakemake rule but also to the commands! Just increasing the number of threads in Snakemake will not magically run a command with multiple threads
    • -
    • Remember that you have 4 threads in total, so even if you ask for more in a rule, Snakemake will cap this value at 4. And if you use 4 threads in a rule, that means that no other job can run parallel!
    • +
    • Remember that you have 4 threads in total, so even if you ask for more in a rule, Snakemake will cap this value at 4. And if you use 4 threads in a rule, that means that no other job can run parallel!
    -
    Answer

    It turns out that all the software except samtools index can handle multi-threading:

    @@ -825,7 +837,9 @@

    Optimising a workflow by multi shell: ''' echo "Trimming reads in <{input.reads1}> and <{input.reads2}>" > {log} - atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a "A{{20}}" -A "A{{20}}" --threads {threads} -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} + atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 \ + --no-cache-adapters -a "A{{20}}" -A "A{{20}}" --threads {threads} \ + -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo "Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively" >> {log} echo "Trimming report saved in <{log}>" >> {log} ''' @@ -850,7 +864,9 @@

    Optimising a workflow by multi shell: ''' echo "Mapping the reads" > {log} - hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x resources/genome_indices/Scerevisiae_index --threads {threads} -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} + hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \ + -x resources/genome_indices/Scerevisiae_index --threads {threads} \ + -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo "Mapped reads saved in <{output.sam}>" >> {log} echo "Mapping report saved in <{output.report}>" >> {log} ''' @@ -904,7 +920,8 @@

    Optimising a workflow by multi shell: ''' echo "Counting reads mapping on genes in <{input.bam_once_sorted}>" > {log} - featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a resources/Scerevisiae.gtf -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} + featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \ + -a resources/Scerevisiae.gtf -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo "Renaming output files" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo "Results saved in <{output.gene_level}>" >> {log} @@ -915,14 +932,16 @@

    Optimising a workflow by multi

    Exercise: Finally, test the effect of the number of threads on the workflow’s runtime. What command will you use to run the workflow? Does the workflow run faster?

    Answer -

    The command to use is snakemake --cores 4 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv. Do not forget to provide additional cores to Snakemake in the execution command with --cores 4. Note that the number of threads allocated to all jobs running at a given time cannot exceed the value specified with --cores. Therefore, if you leave this number at 1, Snakemake will not be able to use multiple threads. Also note that increasing --cores allows Snakemake to run multiple jobs in parallel (for example, running 2 jobs using 2 threads each). The workflow now takes ~6 min to run, compared to ~10 min before (i.e. a 40% decrease!). This gives you an idea of how powerful multi-threading is when the datasets and computing power get bigger!

    +

    The command to use is:

    +

    snakemake --cores 4 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv

    +

    Do not forget to provide additional cores to Snakemake in the execution command with --cores 4. Note that the number of threads allocated to all jobs running at a given time cannot exceed the value specified with --cores. Therefore, if you leave this number at 1, Snakemake will not be able to use multiple threads. Also note that increasing --cores allows Snakemake to run multiple jobs in parallel (for example, running 2 jobs using 2 threads each). The workflow now takes ~6 min to run, compared to ~10 min before (i.e. a 40% decrease!). This gives you an idea of how powerful multi-threading is when the datasets and computing power get bigger!

    Explicit is better than implicit

    Even if a software cannot multi-thread, it is useful to add threads: 1 in the rule to keep the rule consistency and clearly state that the software works with a single thread.

    -

    Keep in mind when using parallel execution

    +

    Things to keep in mind when using parallel execution

    • Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash
    • The on-screen output from parallel jobs will be mixed, so save any output to log files instead
    • @@ -935,7 +954,10 @@

      Non-file parameters

    • In the rule read_mapping, the index parameter -x resources/genome_indices/Scerevisiae_index
    • In the rule reads_quantification_genes, the annotation parameter -a resources/Scerevisiae.gtf
    -

    This reduces readability and also makes it very hard to change the value of these parameters. The params directive was designed for this purpose: it allows to specify additional parameters that can also depend on the wildcards values and use input functions (see Session 4 for more information on this). params values can be of any type (integer, string, list etc…) and similarly to the {input} and {output} placeholders, they can also be accessed from the shell command with the placeholder arams}. Just like for the input and output directives, you can define multiple parameters (in this case, do not forget the comma between each entry!) and they can be named (in practice, unknown parameters are unexplicit and easily confusing, so parameters should always be named!). It also helps readability and clarity to use the params section to name and assign parameters and variables for your shell command. Here is an example on how to use params:

    +

    This reduces readability and also makes it very hard to change the value of these parameters.

    +

    The params directive was designed for this purpose: it allows to specify additional parameters that can also depend on the wildcards values and use input functions (see Session 4 for more information on this). params values can be of any type (integer, string, list etc…) and similarly to the {input} and {output} placeholders, they can also be accessed from the shell command with the placeholder {params}. Just like for the input and output directives, you can define multiple parameters (in this case, do not forget the comma between each entry!) and they can be named (in practice, unknown parameters are unexplicit and easily confusing, so parameters should always be named!).

    +

    It also helps readability and clarity to use the params section to name and assign parameters and variables for your shell command.

    +

    Here is an example on how to use params:

    rule example:
         input:
             'data/example.tsv'
    @@ -948,7 +970,7 @@ 

    Non-file parameters

    Parameters arguments

    -

    In contrast to the input directive, the params directive can optionally take more arguments than only wildcards, namely input, output, threads, and resources.

    +

    In contrast to the input directive, the params directive can optionally take more arguments than only wildcards, namely input, output, threads, and resources.

    Exercise: Replace the two hard-coded paths mentioned earlier by params.

    @@ -964,7 +986,9 @@

    Non-file parameters

    params:
         index = 'resources/genome_indices/Scerevisiae_index'
     shell:
    -    'hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x {params.index} --threads {threads} -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log}'
    +    'hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \
    +    -x {params.index} --threads {threads} \
    +    -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log}'
     
    • rule reads_quantification_genes
    • @@ -972,12 +996,14 @@

      Non-file parameters

      params:
           annotations = 'resources/Scerevisiae.gtf'
       shell:
      -    'featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a {params.annotations} -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log}'
      +    'featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \
      +    -a {params.annotations} -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log}'
       

    Snakemake re-run behaviour

    -

    If you try to re-run only the last rule with snakemake --cores 4 -r -p -f results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv, Snakemake will actually try to re-run 3 rules in total. This is because the code changed in 2 rules (see reason field in Snakemake’s log), which triggered an update of the inputs in the 3rd rule (sam_to_bam). To avoid this, first touch the files with snakemake --cores 1 --touch -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv then re-run the last rule.

    +

    If you try to re-run only the last rule with snakemake --cores 4 -r -p -f results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv, Snakemake will actually try to re-run 3 rules in total.

    +

    This is because the code changed in 2 rules (see reason field in Snakemake’s log), which triggered an update of the inputs in the 3rd rule (sam_to_bam). To avoid this, first touch the files with snakemake --cores 1 --touch -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv then re-run the last rule.

    Config files

    That being said, there is an even better way to handle parameters like the we just modified: instead of hard-coding parameter values in the Snakefile, Snakemake allows to define parameters and their values in config files. The config files will be parsed by Snakemake when executing the workflow, and parameters and their values will be stored in a Python dictionary named config. The path to the config file can be specified either in the Snakefile with the line configfile: <path/to/file.yaml> at the top of the file, or it can be specified at runtime with the execution parameter --configfile <path/to/file.yaml>.

    @@ -1152,7 +1178,6 @@

    Use-case of the We will create a single rule to run FastQC on both the original and the trimmed FASTQ files

    Choose only one solution to implement:

    -
    @@ -1185,12 +1210,14 @@

    Use-case of the echo "Creating output directory <{output.before_trim}>" > {log} mkdir -p {output.before_trim} 2>> {log} echo "Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>" >> {log} - fastqc --format fastq --threads {threads} --outdir {output.before_trim} --dir {output.before_trim} {input.reads1} {input.reads2} &>> {log} + fastqc --format fastq --threads {threads} --outdir {output.before_trim} \ + --dir {output.before_trim} {input.reads1} {input.reads2} &>> {log} echo "Results saved in <{output.before_trim}>" >> {log} echo "Creating output directory <{output.after_trim}>" >> {log} mkdir -p {output.after_trim} 2>> {log} echo "Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>" >> {log} - fastqc --format fastq --threads {threads} --outdir {output.after_trim} --dir {output.after_trim} {input.trim1} {input.trim2} &>> {log} + fastqc --format fastq --threads {threads} --outdir {output.after_trim} \ + --dir {output.after_trim} {input.trim1} {input.trim2} &>> {log} echo "Results saved in <{output.after_trim}>" >> {log} '''

    @@ -1249,14 +1276,16 @@

    Use-case of the shell: ''' echo "Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>" >> {log} - fastqc --format fastq --threads {threads} --outdir {params.wd} --dir {params.wd} {input.reads1} {input.reads2} &>> {log} + fastqc --format fastq --threads {threads} --outdir {params.wd} \ + --dir {params.wd} {input.reads1} {input.reads2} &>> {log} echo "Renaming results from original fastq analysis" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_before} {output.html1_before} 2>> {log} mv {params.zipfile1_before} {output.zipfile1_before} 2>> {log} mv {params.html2_before} {output.html2_before} 2>> {log} mv {params.zipfile2_before} {output.zipfile2_before} 2>> {log} echo "Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>" >> {log} - fastqc --format fastq --threads {threads} --outdir {params.wd} --dir {params.wd} {input.trim1} {input.trim2} &>> {log} + fastqc --format fastq --threads {threads} --outdir {params.wd} \ + --dir {params.wd} {input.trim1} {input.trim2} &>> {log} echo "Renaming results from trimmed fastq analysis" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_after} {output.html1_after} 2>> {log} mv {params.zipfile1_after} {output.zipfile1_after} 2>> {log} @@ -1306,7 +1335,9 @@

    Use-case of the Aggregating outputs

    Exercise: Write an expand() syntax to generate a list of outputs from rule reads_quantification_genes with all the RNAseq samples. What do you need to write this?

    Answer -

    The output of rule reads_quantification_genes has the following syntax: 'results/{sample}/{sample}_genes_read_quantification.tsv'. First, we need to create a Python list containing all the values that the {sample} wildcards can take: -SAMPLES = ['highCO2_sample1', 'highCO2_sample2', 'highCO2_sample3', 'lowCO2_sample1', 'lowCO2_sample2', 'lowCO2_sample3']

    -

    Then, we can transform the output syntax with expand(): -expand('results/{sample}/{sample}_genes_read_quantification.tsv', sample=SAMPLES)

    +

    The output of rule reads_quantification_genes has the following syntax: 'results/{sample}/{sample}_genes_read_quantification.tsv'.

    +

    First, we need to create a Python list containing all the values that the {sample} wildcards can take:

    +

    SAMPLES = ['highCO2_sample1', 'highCO2_sample2', 'highCO2_sample3', 'lowCO2_sample1', 'lowCO2_sample2', 'lowCO2_sample3']

    +

    Then, we can transform the output syntax with expand():

    +

    expand('results/{sample}/{sample}_genes_read_quantification.tsv', sample=SAMPLES)

    Exercise: Use these two elements (the list of samples and the expand() syntax) in the target rule to ask Snakemake to generate all the outputs.

    @@ -1485,15 +1517,17 @@

    Aggregating outputs

    An even more Snakemake-idiomatic solution

    -

    There is an even better and more Snakemake-idiomatic version of the expand() syntax: expand(rules.reads_quantification_genes.output.gene_level, sample=config['samples']). While it may not seem easy to use and understand, this entirely removes the need to write the output paths!

    +

    There is an even better and more Snakemake-idiomatic version of the expand() syntax:

    +

    expand(rules.reads_quantification_genes.output.gene_level, sample=config['samples']).

    +

    While it may not seem easy to use and understand, this entirely removes the need to write the output paths!

    Running the other samples of the workflow

    Exercise: Touch the files already present in your workflow to avoid re-creating them and then run your workflow on the 5 other samples.

    Answer
      -
    • To touch the existing files, you can use: snakemake --cores 1 --touch
    • -
    • To run the workflow, you can use snakemake --cores 4 -r -p
    • +
    • Touch the existing files: snakemake --cores 1 --touch
    • +
    • Run the workflow snakemake --cores 4 -r -p

    Thanks to the parallelisation, the workflow execution should take less than 10 min in total to process all the samples!

    @@ -1505,13 +1539,12 @@

    Running the other samples of
  • Generate the filegraph: snakemake --cores 1 -F -r -p --filegraph | dot -Tpng > images/all_samples_filegraph.png
  • -

    Your DAG should resemble this:

    -

    And your filegraph, this:

    +

    And this should be your filegraph:

    diff --git a/course_schedule/index.html b/course_schedule/index.html index 7b46366..e9fb81a 100644 --- a/course_schedule/index.html +++ b/course_schedule/index.html @@ -476,6 +476,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + @@ -648,7 +662,7 @@

    Day 2 - Snakemake

    Block 4 3:30 PM 4:30 PM -Whole workflow +Snakemake, package managers and containers diff --git a/index.html b/index.html index 262d6da..c33ba53 100644 --- a/index.html +++ b/index.html @@ -555,6 +555,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + diff --git a/precourse/index.html b/precourse/index.html index 46b5f0e..6327692 100644 --- a/precourse/index.html +++ b/precourse/index.html @@ -476,6 +476,20 @@ + + + + + +
  • + + Running containers with singularity + +
  • + + + + diff --git a/search/search_index.json b/search/search_index.json index 07d83cc..c835d4d 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Course website Teachers Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Antonin Thi\u00e9baut .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Damir Zhakparov Authors Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Antonin Thi\u00e9baut .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Attribution This course is partly inspired by the Carpentries Docker course and the official Snakemake tutorial . License & copyright License: CC BY-SA 4.0 Copyright: SIB Swiss Institute of Bioinformatics Material This website Google doc (through mail) Learning outcomes General learning outcomes After this course, you will be able to: Understand the basic concepts and terminology associated with virtualization with containers Customize, store, manage and share containerized environments with Docker Use Singularity to run containers on a shared computer environment (e.g. a HPC cluster) Understand the basic concepts and terminology associated with workflow management systems Create a computational workflow that uses containers and package managers with Snakemake Learning outcomes explained To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn. Learning experiences To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only. Exercises Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different. Asking questions During lectures, you are encouraged to raise your hand if you have questions. A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teachers will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand Personal interest questions: #background on slack During exercises: raise hand/ #q-and-a on slack","title":"Home"},{"location":"#course-website","text":"","title":"Course website"},{"location":"#teachers","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Antonin Thi\u00e9baut .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Damir Zhakparov","title":"Teachers"},{"location":"#authors","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Antonin Thi\u00e9baut .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Authors"},{"location":"#attribution","text":"This course is partly inspired by the Carpentries Docker course and the official Snakemake tutorial .","title":"Attribution"},{"location":"#license-copyright","text":"License: CC BY-SA 4.0 Copyright: SIB Swiss Institute of Bioinformatics","title":"License & copyright"},{"location":"#material","text":"This website Google doc (through mail)","title":"Material"},{"location":"#learning-outcomes","text":"","title":"Learning outcomes"},{"location":"#general-learning-outcomes","text":"After this course, you will be able to: Understand the basic concepts and terminology associated with virtualization with containers Customize, store, manage and share containerized environments with Docker Use Singularity to run containers on a shared computer environment (e.g. a HPC cluster) Understand the basic concepts and terminology associated with workflow management systems Create a computational workflow that uses containers and package managers with Snakemake","title":"General learning outcomes"},{"location":"#learning-outcomes-explained","text":"To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn.","title":"Learning outcomes explained"},{"location":"#learning-experiences","text":"To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.","title":"Learning experiences"},{"location":"#exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"#asking-questions","text":"During lectures, you are encouraged to raise your hand if you have questions. A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teachers will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand Personal interest questions: #background on slack During exercises: raise hand/ #q-and-a on slack","title":"Asking questions"},{"location":"course_schedule/","text":"Day 1 - Containers Block Start End subject Block 1 9:00 AM 10:30 AM Introduction to containers 10:30 AM 11:00 AM BREAK Block 2 11:00 AM 12:30 PM Managing containers and images 12:30 PM 1:30 PM BREAK Block 3 1:30 PM 3:00 PM Working with dockerfiles 3:00 PM 3:30 PM BREAK Block 4 3:30 PM 5:00 PM Running containers with singularity Day 2 - Snakemake Block Start End subject Block 1 9:00 AM 10:30 AM Introduction to Snakemake 10:30 AM 11:00 AM BREAK Block 2 11:00 AM 12:30 PM Generalising a Snakemake workflow 12:30 PM 1:30 PM BREAK Block 3 1:30 PM 3:00 PM Decorating a Snakemake workflow 3:00 PM 3:30 PM BREAK Block 4 3:30 PM 4:30 PM Whole workflow 4:30 PM 5:00 PM Wrap-up & Open Q&A","title":"Course schedule"},{"location":"course_schedule/#day-1-containers","text":"Block Start End subject Block 1 9:00 AM 10:30 AM Introduction to containers 10:30 AM 11:00 AM BREAK Block 2 11:00 AM 12:30 PM Managing containers and images 12:30 PM 1:30 PM BREAK Block 3 1:30 PM 3:00 PM Working with dockerfiles 3:00 PM 3:30 PM BREAK Block 4 3:30 PM 5:00 PM Running containers with singularity","title":"Day 1 - Containers"},{"location":"course_schedule/#day-2-snakemake","text":"Block Start End subject Block 1 9:00 AM 10:30 AM Introduction to Snakemake 10:30 AM 11:00 AM BREAK Block 2 11:00 AM 12:30 PM Generalising a Snakemake workflow 12:30 PM 1:30 PM BREAK Block 3 1:30 PM 3:00 PM Decorating a Snakemake workflow 3:00 PM 3:30 PM BREAK Block 4 3:30 PM 4:30 PM Whole workflow 4:30 PM 5:00 PM Wrap-up & Open Q&A","title":"Day 2 - Snakemake"},{"location":"precourse/","text":"UNIX As is stated in the course prerequisites at the announcement web page . We expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial . Software Install Docker on your local computer and create an account on dockerhub . You can find instructions here . Note that you need admin rights to install and use Docker, and if you are installing Docker on Windows, you need a recent Windows version. You should also have a modern code editor installed, like Sublime Text or VScode . If working with Windows During the course exercises you will be mainly interacting with docker through the command line. Although windows powershell is suitable for that, it is easier to follow the exercises if you have UNIX or \u2018UNIX-like\u2019 terminal. You can get this by using WSL2 . Make sure you install the latest versions before installing docker. If installing Docker is a problem During the course, we can give only limited support for installation issues. If you do not manage to install Docker before the course, you can still do almost all exercises on Play with Docker . A Docker login is required. In addition to your local computer, we will be working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through ssh with a username, key and IP address. All participants will be granted access to a personal home directory.","title":"Precourse preparations"},{"location":"precourse/#unix","text":"As is stated in the course prerequisites at the announcement web page . We expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial .","title":"UNIX"},{"location":"precourse/#software","text":"Install Docker on your local computer and create an account on dockerhub . You can find instructions here . Note that you need admin rights to install and use Docker, and if you are installing Docker on Windows, you need a recent Windows version. You should also have a modern code editor installed, like Sublime Text or VScode . If working with Windows During the course exercises you will be mainly interacting with docker through the command line. Although windows powershell is suitable for that, it is easier to follow the exercises if you have UNIX or \u2018UNIX-like\u2019 terminal. You can get this by using WSL2 . Make sure you install the latest versions before installing docker. If installing Docker is a problem During the course, we can give only limited support for installation issues. If you do not manage to install Docker before the course, you can still do almost all exercises on Play with Docker . A Docker login is required. In addition to your local computer, we will be working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through ssh with a username, key and IP address. All participants will be granted access to a personal home directory.","title":"Software"},{"location":"course_material/day1/dockerfiles/","text":"Learning outcomes After having completed this chapter you will be able to: Build an image based on a dockerfile Use the basic dockerfile syntax Change the default command of an image and validate the change Map ports to a container to display interactive content through a browser Material Official Dockerfile reference Ten simple rules for writing dockerfiles Exercises To make your images shareable and adjustable, it\u2019s good practice to work with a Dockerfile . This is a script with a set of instructions to build your image from an existing image. Basic Dockerfile You can generate an image from a Dockerfile using the command docker build . A Dockerfile has its own syntax for giving instructions. Luckily, they are rather simple. The script always contains a line starting with FROM that takes the image name from which the new image will be built. After that you usually want to run some commands to e.g. configure and/or install software. The instruction to run these commands during building starts with RUN . In our figlet example that would be: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet On writing reproducible Dockerfiles At the FROM statement in the above Dockerfile you see that we have added a specific tag to the image (i.e. jammy-20230308 ). We could also have written: FROM ubuntu RUN apt-get update RUN apt-get install figlet This will automatically pull the image with the tag latest . However, if the maintainer of the ubuntu images decides to tag another ubuntu version as latest , rebuilding with the above Dockerfile will not give you the same result. Therefore it\u2019s always good practice to add the (stable) tag to the image in a Dockerfile . More rules on making your Dockerfiles more reproducible here . Exercise: Create a file on your computer called Dockerfile , and paste the above instruction lines in that file. Make the directory containing the Dockerfile your current directory. Build a new image based on that Dockerfile with: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build . docker build --platform amd64 . If using an Apple M1 chip (newer Macs) If you are using a computer with an Apple M1 chip, you have the less common ARM system architecture, which can limit transferability of images to (more common) x86_64/AMD64 machines. When building images on a Mac with an M1 chip (especially if you have sharing in mind), it\u2019s best to specify the --platform amd64 flag. The argument of docker build The command docker build takes a directory as input (providing . means the current directory). This directory should contain the Dockerfile , but it can also contain more of the build context, e.g. (python, R, shell) scripts that are required to build the image. What has happened? What is the name of the build image? Answer A new image was created based on the Dockerfile . You can check it with: docker image ls , which gives something like: REPOSITORY TAG IMAGE ID CREATED SIZE 92c980b09aad 7 seconds ago 101MB ubuntu-figlet latest e08b999c7978 About an hour ago 101MB ubuntu latest f63181f19b2f 30 hours ago 72.9MB It has created an image without a name or tag. That\u2019s a bit inconvenient. Exercise: Build a new image with a specific name. You can do that with adding the option -t to docker build . Before that, remove the nameless image. Hint An image without a name is usually a \u201cdangling image\u201d. You can remove those with docker image prune . Answer Remove the nameless image with docker image prune . After that, rebuild an image with a name: x86_64 / AMD64 ARM (MacOS M1 chip) docker build -t ubuntu-figlet:v2 . docker build --platform amd64 -t ubuntu-figlet:v2 . Using CMD As you might remember the second positional argument of docker run is a command (i.e. docker run IMAGE [CMD] ). If you leave it empty, it uses the default command. You can change the default command in the Dockerfile with an instruction starting with CMD . For example: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD figlet My image works! Exercise: Build a new image based on the above Dockerfile . Can you validate the change using docker image inspect ? Can you overwrite this default with docker run ? Answer Copy the new line to your Dockerfile , and build the new image like this: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t ubuntu-figlet:v3 . docker build --platform amd64 -t ubuntu-figlet:v3 . The command docker inspect ubuntu-figlet:v3 will give: \"Cmd\": [ \"/bin/sh\", \"-c\", \"figlet My image works!\" ] So the default command ( /bin/bash ) has changed to figlet My image works! Running the image (with clean-up ( --rm )): docker run --rm ubuntu-figlet:v3 Will result in: __ __ _ _ _ | \\/ |_ _ (_)_ __ ___ __ _ __ _ ___ __ _____ _ __| | _____| | | |\\/| | | | | | | '_ ` _ \\ / _` |/ _` |/ _ \\ \\ \\ /\\ / / _ \\| '__| |/ / __| | | | | | |_| | | | | | | | | (_| | (_| | __/ \\ V V / (_) | | | <\\__ \\_| |_| |_|\\__, | |_|_| |_| |_|\\__,_|\\__, |\\___| \\_/\\_/ \\___/|_| |_|\\_\\___(_) |___/ |___/ And of course you can overwrite the default command: docker run --rm ubuntu-figlet:v3 figlet another text Resulting in: _ _ _ _ __ _ _ __ ___ | |_| |__ ___ _ __ | |_ _____ _| |_ / _` | '_ \\ / _ \\| __| '_ \\ / _ \\ '__| | __/ _ \\ \\/ / __| | (_| | | | | (_) | |_| | | | __/ | | || __/> <| |_ \\__,_|_| |_|\\___/ \\__|_| |_|\\___|_| \\__\\___/_/\\_\\\\__| Two flavours of CMD You have seen in the output of docker inspect that docker translates the command (i.e. figlet \"my image works!\" ) into this: [\"/bin/sh\", \"-c\", \"figlet 'My image works!'\"] . The notation we used in the Dockerfile is the shell notation while the notation with the square brackets ( [] ) is the exec-notation . You can use both notations in your Dockerfile . Altough the shell notation is more readable, the exec notation is directly used by the image, and therefore less ambiguous. A Dockerfile with shell notation: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD figlet My image works! A Dockerfile with exec notation: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD [ \"/bin/sh\" , \"-c\" , \"figlet My image works!\" ] Exercise: Now push our created image (with a version tag) to docker hub. We will use it later for the singularity exercises . Answer docker tag ubuntu-figlet:v3 [ USER NAME ] /ubuntu-figlet:v3 docker push [ USER NAME ] /ubuntu-figlet:v3 Build an image for your own script Often containers are built for a specific purpose. For example, you can use a container to ship all dependencies together with your developed set of scripts/programs. For that you will need to add your scripts to the container. That is quite easily done with the instruction COPY . However, in order to make your container more user-friendly, there are several additional instructions that can come in useful. We will treat the most frequently used ones below. Depending on your preference, either choose R or Python below. In the exercises will use a script called test_deseq2.R . This script will: Load the DESeq2 and optparse packages Load some additional packages to test their installations. We will use those packages later on in the course. Create and parse an option called --rows with optparse Create a dummy count matrix Run DESeq2 on the dummy count matrix Print the results to stdout You can download it here , or copy-paste it: test_deseq2.R #!/usr/bin/env Rscript # load packages required for this script write ( \"Loading packages required for this script\" , stderr ()) suppressPackageStartupMessages ({ library ( DESeq2 ) library ( optparse ) }) # load dependency packages for testing installations write ( \"Loading dependency packages for testing installations\" , stderr ()) suppressPackageStartupMessages ({ library ( apeglm ) library ( IHW ) library ( limma ) library ( data.table ) library ( ggplot2 ) library ( ggrepel ) library ( pheatmap ) library ( RColorBrewer ) library ( scales ) library ( stringr ) }) # parse options with optparse option_list <- list ( make_option ( c ( \"--rows\" ), type = \"integer\" , help = \"Number of rows in dummy matrix [default = %default]\" , default = 100 ) ) opt_parser <- OptionParser ( option_list = option_list , description = \"Runs DESeq2 on dummy data\" ) opt <- parse_args ( opt_parser ) # create a random dummy count matrix cnts <- matrix ( rnbinom ( n = opt $ row * 10 , mu = 100 , size = 1 / 0.5 ), ncol = 10 ) cond <- factor ( rep ( 1 : 2 , each = 5 )) # object construction dds <- DESeqDataSetFromMatrix ( cnts , DataFrame ( cond ), ~ cond ) # standard analysis dds <- DESeq ( dds ) res <- results ( dds ) # print results to stdout print ( res ) After you have downloaded it, make sure to set the permissions to executable: chmod +x test_deseq2.R It is a relatively simple script that runs DESeq2 on a dummy dataset. An example for execution would be: ./test_deseq2.R --rows 100 Here, --rows is a optional arguments that specifies the number of rows generated in the input count matrix. When running the script, it will return a bunch of messages and at the end an overview of differential gene expression analysis results: baseMean log2FoldChange lfcSE stat pvalue padj 1 66.1249 0.281757 0.727668 0.387206 0.698604 0.989804 2 76.9682 0.305763 0.619209 0.493796 0.621451 0.989804 3 64.7843 -0.694525 0.479445 -1.448603 0.147448 0.931561 4 123.0252 0.631247 0.688564 0.916758 0.359269 0.931561 5 93.2002 -0.453430 0.686043 -0.660936 0.508653 0.941951 ... ... ... ... ... ... ... 96 64.0177 0.757585137 0.682683 1.109718054 0.267121 0.931561 97 114.3689 -0.580010850 0.640313 -0.905823841 0.365029 0.931561 98 79.9620 0.000100617 0.612442 0.000164288 0.999869 0.999869 99 92.6614 0.563514308 0.716109 0.786910869 0.431334 0.939106 100 96.4410 -0.155268696 0.534400 -0.290547708 0.771397 0.989804 From the script you can see it has DESeq2 and optparse as dependencies. If we want to run the script inside a container, we would have to install them. We do this in the Dockerfile below. We give it the following instructions: use the r2u base image version jammy install the package DESeq2 , optparse and some additional packages we will need later on. We perform the installations with install2.r , which is a helper command that is present inside most rocker images. More info here . copy the script test_deseq2.R to /opt inside the container: FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt Note In order to use COPY , the file that needs to be copied needs to be in the same directory as the Dockerfile or one of its subdirectories. R image stack The most used R image stack is from the rocker project . It contains many different base images (e.g. with shiny, Rstudio, tidyverse etc.). It depends on the type of image whether installations with apt-get or install2.r are possible. To understand more about how to install R packages in different containers, check it this cheat sheet , or visit rocker-project.org . Exercise: Download the test_deseq2.R and build the image with docker build . Name the image deseq2 . After that, start an interactive session and execute the script inside the container. Hint Make an interactive session with the options -i and -t and use /bin/bash as the command. Answer Build the container: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t deseq2 . docker build --platform amd64 -t deseq2 . Run the container: docker run -it --rm deseq2 /bin/bash Inside the container we look up the script: cd /opt ls This should return test_deseq2.R . Now you can execute it from inside the container: ./test_deseq2.R --rows 100 That\u2019s kind of nice. We can ship our R script inside our container. However, we don\u2019t want to run it interactively every time. So let\u2019s make some changes to make it easy to run it as an executable. For example, we can add /opt to the global $PATH variable with ENV . The $PATH variable The path variable is a special variable that consists of a list of path seperated by colons ( : ). These paths are searched if you are trying to run an executable. More info this topic at e.g. wikipedia . FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt ENV PATH = /opt: $PATH Note The ENV instruction can be used to set any variable. Exercise : Rebuild the image and start an interactive bash session inside the new image. Is the path variable updated? (i.e. can we execute test_deseq2.R from anywhere?) Answer After re-building we start an interactive session: docker run -it --rm deseq2 /bin/bash The path is upated, /opt is appended to the beginning of the variable: echo $PATH returns: /opt:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin Now you can try to execute it from the root directory (or any other): test_deseq2.R Instead of starting an interactive session with /bin/bash we can now more easily run the script non-interactively: docker run --rm deseq2 test_deseq2.R --rows 100 Now it will directly print the output of test_deseq2.R to stdout. In the case you want to pack your script inside a container, you are building a container specifically for your script, meaning you almost want the container to behave as the program itself. In order to do that, you can use ENTRYPOINT . ENTRYPOINT is similar to CMD , but has two important differences: ENTRYPOINT can not be overwritten by the positional arguments (i.e. docker run image [CMD] ), but has to be overwritten by --entrypoint . The positional arguments (or CMD ) are pasted to the ENTRYPOINT command. This means that you can use ENTRYPOINT as the executable and the positional arguments (or CMD ) as the options. Let\u2019s try it out: FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] Exercise : Re-build, and run the container non-interactively without any positional arguments. After that, try to pass a different number of rows to --rows . How do the commands look? Answer Just running the container non-interactively would be: docker run --rm deseq2 Passing a different argument (i.e. overwriting CMD ) would be: docker run --rm deseq2 --rows 200 Here, the container behaves as the executable itself to which you can pass arguments. Most containerized applications need multiple build steps. Often, you want to perform these steps and executions in a specific directory. Therefore, it can be in convenient to specify a working directory. You can do that with WORKDIR . This instruction will set the default directory for all other instructions (like RUN , COPY etc.). It will also change the directory in which you will land if you run the container interactively. FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr WORKDIR /opt COPY test_deseq2.R . ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] Exercise : build the image, and start the container interactively. Has the default directory changed? After that, push the image to dockerhub, so we can use it later with the singularity exercises. Note You can overwrite ENTRYPOINT with --entrypoint as an argument to docker run . Answer Running the container interactively would be: docker run -it --rm --entrypoint /bin/bash deseq2 Which should result in a terminal looking something like this: root@9a27da455fb1:/opt# Meaning that indeed the default directory has changed to /opt Pushing it to dockerhub: docker tag deseq2 [ USER NAME ] /deseq2:v1 docker push [ USER NAME ] /deseq2:v1 Get information on your image with docker inspect We have used docker inspect already in the previous chapter to find the default Cmd of the ubuntu image. However we can get more info on the image: e.g. the entrypoint, environmental variables, cmd, workingdir etc., you can use the Config record from the output of docker inspect . For our image this looks like: \"Config\" : { \"Hostname\" : \"\" , \"Domainname\" : \"\" , \"User\" : \"\" , \"AttachStdin\" : false , \"AttachStdout\" : false , \"AttachStderr\" : false , \"Tty\" : false , \"OpenStdin\" : false , \"StdinOnce\" : false , \"Env\" : [ \"PATH=/opt:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\" , \"LC_ALL=en_US.UTF-8\" , \"LANG=en_US.UTF-8\" , \"DEBIAN_FRONTEND=noninteractive\" , \"TZ=UTC\" ], \"Cmd\" : [ \"--rows\" , \"100\" ], \"ArgsEscaped\" : true , \"Image\" : \"\" , \"Volumes\" : null , \"WorkingDir\" : \"/opt\" , \"Entrypoint\" : [ \"test_deseq2.R\" ], \"OnBuild\" : null , \"Labels\" : { \"maintainer\" : \"Dirk Eddelbuettel \" , \"org.label-schema.license\" : \"GPL-2.0\" , \"org.label-schema.vcs-url\" : \"https://github.com/rocker-org/\" , \"org.label-schema.vendor\" : \"Rocker Project\" } } Adding metadata to your image You can annotate your Dockerfile and the image by using the instruction LABEL . You can give it any key and value with = . However, it is recommended to use the Open Container Initiative (OCI) keys . Exercise : Annotate our Dockerfile with the OCI keys on the creation date, author and description. After that, check whether this has been passed to the actual image with docker inspect . Note You can type LABEL for each key-value pair, but you can also have it on one line by seperating the key-value pairs by a space, e.g.: LABEL keyx = \"valuex\" keyy = \"valuey\" Answer The Dockerfile would look like: FROM rocker/r2u:jammy LABEL org.opencontainers.image.created = \"2023-04-12\" \\ org.opencontainers.image.authors = \"Geert van Geest\" \\ org.opencontainers.image.description = \"Container with DESeq2 and friends\" RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr WORKDIR /opt COPY test_deseq2.R . ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] The Config record in the output of docker inspect was updated with: \"Labels\" : { \"org.opencontainers.image.authors\" : \"Geert van Geest\" , \"org.opencontainers.image.created\" : \"2023-04-12\" , \"org.opencontainers.image.description\" : \"Container with DESeq2 and friends\" , \"org.opencontainers.image.licenses\" : \"GPL-2.0-or-later\" , \"org.opencontainers.image.source\" : \"https://github.com/rocker-org/rocker\" , \"org.opencontainers.image.vendor\" : \"Rocker Project\" } Building an image with a browser interface In this exercise, we will use a different base image ( rocker/rstudio:4 ), and we\u2019ll install the same packages. Rstudio server is a nice browser interface that you can use for a.o. programming in R. With the image we are creating we will be able to run Rstudio server inside a container. Check out the Dockerfile : FROM rocker/rstudio:4 RUN apt-get update && \\ apt-get install -y libz-dev RUN install2.r \\ optparse \\ BiocManager RUN R -q -e 'BiocManager::install(\"biomaRt\")' This will create an image from the existing rstudio image. It will also install libz-dev with apt-get , BiocManager with install2.r and DESeq2 with an R command. Despite we\u2019re installing the same packages, the installation steps need to be different from the r-base image. This is because in the rocker/rstudio images R is installed from source, and therefore you can\u2019t install packages with apt-get . More information on how to install R packages in R containers in this cheat sheet , or visit rocker-project.org . Installation will take a while The installation of CRAN packages will go relatively quickly, because can use the binary packages supplied by Posit Public Package Manager . However, the installation of Bioconductor packages will take a while, because they need to be installed from source. If you don\u2019t have time, you can skip the DESeq2 installation by removing the last line of the Dockerfile . Exercise: Build an image based on this Dockerfile and give it a meaningful name. Answer x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t rstudio-server . docker build --platform amd64 -t rstudio-server . You can now run a container from the image. However, you will have to tell docker where to publish port 8787 from the docker container with -p [HOSTPORT:CONTAINERPORT] . We choose to publish it to the same port number: docker run --rm -it -p 8787 :8787 rstudio-server Networking More info on docker container networking here By running the above command, a container will be started exposing rstudio server at port 8787 at localhost. You can approach the instance of Rstudio server by typing localhost:8787 in your browser. You will be asked for a password. You can find this password in the terminal from which you have started the container. We can make this even more interesting by mounting a local directory to the container running the Rstudio image: docker run \\ -it \\ --rm \\ -p 8787 :8787 \\ --mount type = bind,source = /Users/myusername/working_dir,target = /home/rstudio/working_dir \\ rstudio-server By doing this you have a completely isolated and shareable R environment running Rstudio server, but with your local files available to it. Pretty neat right?","title":"Working with dockerfiles"},{"location":"course_material/day1/dockerfiles/#learning-outcomes","text":"After having completed this chapter you will be able to: Build an image based on a dockerfile Use the basic dockerfile syntax Change the default command of an image and validate the change Map ports to a container to display interactive content through a browser","title":"Learning outcomes"},{"location":"course_material/day1/dockerfiles/#material","text":"Official Dockerfile reference Ten simple rules for writing dockerfiles","title":"Material"},{"location":"course_material/day1/dockerfiles/#exercises","text":"To make your images shareable and adjustable, it\u2019s good practice to work with a Dockerfile . This is a script with a set of instructions to build your image from an existing image.","title":"Exercises"},{"location":"course_material/day1/dockerfiles/#basic-dockerfile","text":"You can generate an image from a Dockerfile using the command docker build . A Dockerfile has its own syntax for giving instructions. Luckily, they are rather simple. The script always contains a line starting with FROM that takes the image name from which the new image will be built. After that you usually want to run some commands to e.g. configure and/or install software. The instruction to run these commands during building starts with RUN . In our figlet example that would be: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet On writing reproducible Dockerfiles At the FROM statement in the above Dockerfile you see that we have added a specific tag to the image (i.e. jammy-20230308 ). We could also have written: FROM ubuntu RUN apt-get update RUN apt-get install figlet This will automatically pull the image with the tag latest . However, if the maintainer of the ubuntu images decides to tag another ubuntu version as latest , rebuilding with the above Dockerfile will not give you the same result. Therefore it\u2019s always good practice to add the (stable) tag to the image in a Dockerfile . More rules on making your Dockerfiles more reproducible here . Exercise: Create a file on your computer called Dockerfile , and paste the above instruction lines in that file. Make the directory containing the Dockerfile your current directory. Build a new image based on that Dockerfile with: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build . docker build --platform amd64 . If using an Apple M1 chip (newer Macs) If you are using a computer with an Apple M1 chip, you have the less common ARM system architecture, which can limit transferability of images to (more common) x86_64/AMD64 machines. When building images on a Mac with an M1 chip (especially if you have sharing in mind), it\u2019s best to specify the --platform amd64 flag. The argument of docker build The command docker build takes a directory as input (providing . means the current directory). This directory should contain the Dockerfile , but it can also contain more of the build context, e.g. (python, R, shell) scripts that are required to build the image. What has happened? What is the name of the build image? Answer A new image was created based on the Dockerfile . You can check it with: docker image ls , which gives something like: REPOSITORY TAG IMAGE ID CREATED SIZE 92c980b09aad 7 seconds ago 101MB ubuntu-figlet latest e08b999c7978 About an hour ago 101MB ubuntu latest f63181f19b2f 30 hours ago 72.9MB It has created an image without a name or tag. That\u2019s a bit inconvenient. Exercise: Build a new image with a specific name. You can do that with adding the option -t to docker build . Before that, remove the nameless image. Hint An image without a name is usually a \u201cdangling image\u201d. You can remove those with docker image prune . Answer Remove the nameless image with docker image prune . After that, rebuild an image with a name: x86_64 / AMD64 ARM (MacOS M1 chip) docker build -t ubuntu-figlet:v2 . docker build --platform amd64 -t ubuntu-figlet:v2 .","title":"Basic Dockerfile"},{"location":"course_material/day1/dockerfiles/#using-cmd","text":"As you might remember the second positional argument of docker run is a command (i.e. docker run IMAGE [CMD] ). If you leave it empty, it uses the default command. You can change the default command in the Dockerfile with an instruction starting with CMD . For example: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD figlet My image works! Exercise: Build a new image based on the above Dockerfile . Can you validate the change using docker image inspect ? Can you overwrite this default with docker run ? Answer Copy the new line to your Dockerfile , and build the new image like this: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t ubuntu-figlet:v3 . docker build --platform amd64 -t ubuntu-figlet:v3 . The command docker inspect ubuntu-figlet:v3 will give: \"Cmd\": [ \"/bin/sh\", \"-c\", \"figlet My image works!\" ] So the default command ( /bin/bash ) has changed to figlet My image works! Running the image (with clean-up ( --rm )): docker run --rm ubuntu-figlet:v3 Will result in: __ __ _ _ _ | \\/ |_ _ (_)_ __ ___ __ _ __ _ ___ __ _____ _ __| | _____| | | |\\/| | | | | | | '_ ` _ \\ / _` |/ _` |/ _ \\ \\ \\ /\\ / / _ \\| '__| |/ / __| | | | | | |_| | | | | | | | | (_| | (_| | __/ \\ V V / (_) | | | <\\__ \\_| |_| |_|\\__, | |_|_| |_| |_|\\__,_|\\__, |\\___| \\_/\\_/ \\___/|_| |_|\\_\\___(_) |___/ |___/ And of course you can overwrite the default command: docker run --rm ubuntu-figlet:v3 figlet another text Resulting in: _ _ _ _ __ _ _ __ ___ | |_| |__ ___ _ __ | |_ _____ _| |_ / _` | '_ \\ / _ \\| __| '_ \\ / _ \\ '__| | __/ _ \\ \\/ / __| | (_| | | | | (_) | |_| | | | __/ | | || __/> <| |_ \\__,_|_| |_|\\___/ \\__|_| |_|\\___|_| \\__\\___/_/\\_\\\\__| Two flavours of CMD You have seen in the output of docker inspect that docker translates the command (i.e. figlet \"my image works!\" ) into this: [\"/bin/sh\", \"-c\", \"figlet 'My image works!'\"] . The notation we used in the Dockerfile is the shell notation while the notation with the square brackets ( [] ) is the exec-notation . You can use both notations in your Dockerfile . Altough the shell notation is more readable, the exec notation is directly used by the image, and therefore less ambiguous. A Dockerfile with shell notation: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD figlet My image works! A Dockerfile with exec notation: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD [ \"/bin/sh\" , \"-c\" , \"figlet My image works!\" ] Exercise: Now push our created image (with a version tag) to docker hub. We will use it later for the singularity exercises . Answer docker tag ubuntu-figlet:v3 [ USER NAME ] /ubuntu-figlet:v3 docker push [ USER NAME ] /ubuntu-figlet:v3","title":"Using CMD"},{"location":"course_material/day1/dockerfiles/#build-an-image-for-your-own-script","text":"Often containers are built for a specific purpose. For example, you can use a container to ship all dependencies together with your developed set of scripts/programs. For that you will need to add your scripts to the container. That is quite easily done with the instruction COPY . However, in order to make your container more user-friendly, there are several additional instructions that can come in useful. We will treat the most frequently used ones below. Depending on your preference, either choose R or Python below. In the exercises will use a script called test_deseq2.R . This script will: Load the DESeq2 and optparse packages Load some additional packages to test their installations. We will use those packages later on in the course. Create and parse an option called --rows with optparse Create a dummy count matrix Run DESeq2 on the dummy count matrix Print the results to stdout You can download it here , or copy-paste it: test_deseq2.R #!/usr/bin/env Rscript # load packages required for this script write ( \"Loading packages required for this script\" , stderr ()) suppressPackageStartupMessages ({ library ( DESeq2 ) library ( optparse ) }) # load dependency packages for testing installations write ( \"Loading dependency packages for testing installations\" , stderr ()) suppressPackageStartupMessages ({ library ( apeglm ) library ( IHW ) library ( limma ) library ( data.table ) library ( ggplot2 ) library ( ggrepel ) library ( pheatmap ) library ( RColorBrewer ) library ( scales ) library ( stringr ) }) # parse options with optparse option_list <- list ( make_option ( c ( \"--rows\" ), type = \"integer\" , help = \"Number of rows in dummy matrix [default = %default]\" , default = 100 ) ) opt_parser <- OptionParser ( option_list = option_list , description = \"Runs DESeq2 on dummy data\" ) opt <- parse_args ( opt_parser ) # create a random dummy count matrix cnts <- matrix ( rnbinom ( n = opt $ row * 10 , mu = 100 , size = 1 / 0.5 ), ncol = 10 ) cond <- factor ( rep ( 1 : 2 , each = 5 )) # object construction dds <- DESeqDataSetFromMatrix ( cnts , DataFrame ( cond ), ~ cond ) # standard analysis dds <- DESeq ( dds ) res <- results ( dds ) # print results to stdout print ( res ) After you have downloaded it, make sure to set the permissions to executable: chmod +x test_deseq2.R It is a relatively simple script that runs DESeq2 on a dummy dataset. An example for execution would be: ./test_deseq2.R --rows 100 Here, --rows is a optional arguments that specifies the number of rows generated in the input count matrix. When running the script, it will return a bunch of messages and at the end an overview of differential gene expression analysis results: baseMean log2FoldChange lfcSE stat pvalue padj 1 66.1249 0.281757 0.727668 0.387206 0.698604 0.989804 2 76.9682 0.305763 0.619209 0.493796 0.621451 0.989804 3 64.7843 -0.694525 0.479445 -1.448603 0.147448 0.931561 4 123.0252 0.631247 0.688564 0.916758 0.359269 0.931561 5 93.2002 -0.453430 0.686043 -0.660936 0.508653 0.941951 ... ... ... ... ... ... ... 96 64.0177 0.757585137 0.682683 1.109718054 0.267121 0.931561 97 114.3689 -0.580010850 0.640313 -0.905823841 0.365029 0.931561 98 79.9620 0.000100617 0.612442 0.000164288 0.999869 0.999869 99 92.6614 0.563514308 0.716109 0.786910869 0.431334 0.939106 100 96.4410 -0.155268696 0.534400 -0.290547708 0.771397 0.989804 From the script you can see it has DESeq2 and optparse as dependencies. If we want to run the script inside a container, we would have to install them. We do this in the Dockerfile below. We give it the following instructions: use the r2u base image version jammy install the package DESeq2 , optparse and some additional packages we will need later on. We perform the installations with install2.r , which is a helper command that is present inside most rocker images. More info here . copy the script test_deseq2.R to /opt inside the container: FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt Note In order to use COPY , the file that needs to be copied needs to be in the same directory as the Dockerfile or one of its subdirectories. R image stack The most used R image stack is from the rocker project . It contains many different base images (e.g. with shiny, Rstudio, tidyverse etc.). It depends on the type of image whether installations with apt-get or install2.r are possible. To understand more about how to install R packages in different containers, check it this cheat sheet , or visit rocker-project.org . Exercise: Download the test_deseq2.R and build the image with docker build . Name the image deseq2 . After that, start an interactive session and execute the script inside the container. Hint Make an interactive session with the options -i and -t and use /bin/bash as the command. Answer Build the container: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t deseq2 . docker build --platform amd64 -t deseq2 . Run the container: docker run -it --rm deseq2 /bin/bash Inside the container we look up the script: cd /opt ls This should return test_deseq2.R . Now you can execute it from inside the container: ./test_deseq2.R --rows 100 That\u2019s kind of nice. We can ship our R script inside our container. However, we don\u2019t want to run it interactively every time. So let\u2019s make some changes to make it easy to run it as an executable. For example, we can add /opt to the global $PATH variable with ENV . The $PATH variable The path variable is a special variable that consists of a list of path seperated by colons ( : ). These paths are searched if you are trying to run an executable. More info this topic at e.g. wikipedia . FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt ENV PATH = /opt: $PATH Note The ENV instruction can be used to set any variable. Exercise : Rebuild the image and start an interactive bash session inside the new image. Is the path variable updated? (i.e. can we execute test_deseq2.R from anywhere?) Answer After re-building we start an interactive session: docker run -it --rm deseq2 /bin/bash The path is upated, /opt is appended to the beginning of the variable: echo $PATH returns: /opt:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin Now you can try to execute it from the root directory (or any other): test_deseq2.R Instead of starting an interactive session with /bin/bash we can now more easily run the script non-interactively: docker run --rm deseq2 test_deseq2.R --rows 100 Now it will directly print the output of test_deseq2.R to stdout. In the case you want to pack your script inside a container, you are building a container specifically for your script, meaning you almost want the container to behave as the program itself. In order to do that, you can use ENTRYPOINT . ENTRYPOINT is similar to CMD , but has two important differences: ENTRYPOINT can not be overwritten by the positional arguments (i.e. docker run image [CMD] ), but has to be overwritten by --entrypoint . The positional arguments (or CMD ) are pasted to the ENTRYPOINT command. This means that you can use ENTRYPOINT as the executable and the positional arguments (or CMD ) as the options. Let\u2019s try it out: FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] Exercise : Re-build, and run the container non-interactively without any positional arguments. After that, try to pass a different number of rows to --rows . How do the commands look? Answer Just running the container non-interactively would be: docker run --rm deseq2 Passing a different argument (i.e. overwriting CMD ) would be: docker run --rm deseq2 --rows 200 Here, the container behaves as the executable itself to which you can pass arguments. Most containerized applications need multiple build steps. Often, you want to perform these steps and executions in a specific directory. Therefore, it can be in convenient to specify a working directory. You can do that with WORKDIR . This instruction will set the default directory for all other instructions (like RUN , COPY etc.). It will also change the directory in which you will land if you run the container interactively. FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr WORKDIR /opt COPY test_deseq2.R . ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] Exercise : build the image, and start the container interactively. Has the default directory changed? After that, push the image to dockerhub, so we can use it later with the singularity exercises. Note You can overwrite ENTRYPOINT with --entrypoint as an argument to docker run . Answer Running the container interactively would be: docker run -it --rm --entrypoint /bin/bash deseq2 Which should result in a terminal looking something like this: root@9a27da455fb1:/opt# Meaning that indeed the default directory has changed to /opt Pushing it to dockerhub: docker tag deseq2 [ USER NAME ] /deseq2:v1 docker push [ USER NAME ] /deseq2:v1","title":"Build an image for your own script"},{"location":"course_material/day1/dockerfiles/#get-information-on-your-image-with-docker-inspect","text":"We have used docker inspect already in the previous chapter to find the default Cmd of the ubuntu image. However we can get more info on the image: e.g. the entrypoint, environmental variables, cmd, workingdir etc., you can use the Config record from the output of docker inspect . For our image this looks like: \"Config\" : { \"Hostname\" : \"\" , \"Domainname\" : \"\" , \"User\" : \"\" , \"AttachStdin\" : false , \"AttachStdout\" : false , \"AttachStderr\" : false , \"Tty\" : false , \"OpenStdin\" : false , \"StdinOnce\" : false , \"Env\" : [ \"PATH=/opt:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\" , \"LC_ALL=en_US.UTF-8\" , \"LANG=en_US.UTF-8\" , \"DEBIAN_FRONTEND=noninteractive\" , \"TZ=UTC\" ], \"Cmd\" : [ \"--rows\" , \"100\" ], \"ArgsEscaped\" : true , \"Image\" : \"\" , \"Volumes\" : null , \"WorkingDir\" : \"/opt\" , \"Entrypoint\" : [ \"test_deseq2.R\" ], \"OnBuild\" : null , \"Labels\" : { \"maintainer\" : \"Dirk Eddelbuettel \" , \"org.label-schema.license\" : \"GPL-2.0\" , \"org.label-schema.vcs-url\" : \"https://github.com/rocker-org/\" , \"org.label-schema.vendor\" : \"Rocker Project\" } }","title":"Get information on your image with docker inspect"},{"location":"course_material/day1/dockerfiles/#adding-metadata-to-your-image","text":"You can annotate your Dockerfile and the image by using the instruction LABEL . You can give it any key and value with = . However, it is recommended to use the Open Container Initiative (OCI) keys . Exercise : Annotate our Dockerfile with the OCI keys on the creation date, author and description. After that, check whether this has been passed to the actual image with docker inspect . Note You can type LABEL for each key-value pair, but you can also have it on one line by seperating the key-value pairs by a space, e.g.: LABEL keyx = \"valuex\" keyy = \"valuey\" Answer The Dockerfile would look like: FROM rocker/r2u:jammy LABEL org.opencontainers.image.created = \"2023-04-12\" \\ org.opencontainers.image.authors = \"Geert van Geest\" \\ org.opencontainers.image.description = \"Container with DESeq2 and friends\" RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr WORKDIR /opt COPY test_deseq2.R . ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] The Config record in the output of docker inspect was updated with: \"Labels\" : { \"org.opencontainers.image.authors\" : \"Geert van Geest\" , \"org.opencontainers.image.created\" : \"2023-04-12\" , \"org.opencontainers.image.description\" : \"Container with DESeq2 and friends\" , \"org.opencontainers.image.licenses\" : \"GPL-2.0-or-later\" , \"org.opencontainers.image.source\" : \"https://github.com/rocker-org/rocker\" , \"org.opencontainers.image.vendor\" : \"Rocker Project\" }","title":"Adding metadata to your image"},{"location":"course_material/day1/dockerfiles/#building-an-image-with-a-browser-interface","text":"In this exercise, we will use a different base image ( rocker/rstudio:4 ), and we\u2019ll install the same packages. Rstudio server is a nice browser interface that you can use for a.o. programming in R. With the image we are creating we will be able to run Rstudio server inside a container. Check out the Dockerfile : FROM rocker/rstudio:4 RUN apt-get update && \\ apt-get install -y libz-dev RUN install2.r \\ optparse \\ BiocManager RUN R -q -e 'BiocManager::install(\"biomaRt\")' This will create an image from the existing rstudio image. It will also install libz-dev with apt-get , BiocManager with install2.r and DESeq2 with an R command. Despite we\u2019re installing the same packages, the installation steps need to be different from the r-base image. This is because in the rocker/rstudio images R is installed from source, and therefore you can\u2019t install packages with apt-get . More information on how to install R packages in R containers in this cheat sheet , or visit rocker-project.org . Installation will take a while The installation of CRAN packages will go relatively quickly, because can use the binary packages supplied by Posit Public Package Manager . However, the installation of Bioconductor packages will take a while, because they need to be installed from source. If you don\u2019t have time, you can skip the DESeq2 installation by removing the last line of the Dockerfile . Exercise: Build an image based on this Dockerfile and give it a meaningful name. Answer x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t rstudio-server . docker build --platform amd64 -t rstudio-server . You can now run a container from the image. However, you will have to tell docker where to publish port 8787 from the docker container with -p [HOSTPORT:CONTAINERPORT] . We choose to publish it to the same port number: docker run --rm -it -p 8787 :8787 rstudio-server Networking More info on docker container networking here By running the above command, a container will be started exposing rstudio server at port 8787 at localhost. You can approach the instance of Rstudio server by typing localhost:8787 in your browser. You will be asked for a password. You can find this password in the terminal from which you have started the container. We can make this even more interesting by mounting a local directory to the container running the Rstudio image: docker run \\ -it \\ --rm \\ -p 8787 :8787 \\ --mount type = bind,source = /Users/myusername/working_dir,target = /home/rstudio/working_dir \\ rstudio-server By doing this you have a completely isolated and shareable R environment running Rstudio server, but with your local files available to it. Pretty neat right?","title":"Building an image with a browser interface"},{"location":"course_material/day1/introduction_containers/","text":"Learning outcomes After having completed this chapter you will be able to: Discriminate between an image and a container Run a docker container from dockerhub interactively Validate the available containers and their status Material General introduction: Download the presentation Introduction to containers: Download the presentation Exercises We recommend using a code editor like VScode or Sublime text. If you don\u2019t know which one to chose, take VScode as we can provide most support for this editor. If working on Windows If you are working on Windows, it is easiest to work with WSL2 . With VScode use the WSL extension . Make sure you install the latest versions before you install docker. In principle, you can also use a native shell like PowerShell, but this might result into some issues with bind mounting directories. Work in projects We recommend to work in a project folder. This will make it easier to find your files and to share them with others. You can create a project folder anywhere on your computer. For example, you can create a folder projects in your home directory and then create a subfolder docker-snakemake-course in it. You can then open this folder in VScode. Let\u2019s create our first container from an existing image. We do this with the image ubuntu , generating an environment with a minimal installation of ubuntu. docker run -it ubuntu This will give you an interactive shell into the created container (this interactivity was invoked by the options -i and -t ) . Exercise: Check out the operating system of the container by typing cat /etc/os-release in the container\u2019s shell. Are we really in an ubuntu environment? Answer Yes: root@27f7d11608de:/# cat /etc/os-release NAME=\"Ubuntu\" VERSION=\"20.04.1 LTS (Focal Fossa)\" ID=ubuntu ID_LIKE=debian PRETTY_NAME=\"Ubuntu 20.04.1 LTS\" VERSION_ID=\"20.04\" HOME_URL=\"https://www.ubuntu.com/\" SUPPORT_URL=\"https://help.ubuntu.com/\" BUG_REPORT_URL=\"https://bugs.launchpad.net/ubuntu/\" PRIVACY_POLICY_URL=\"https://www.ubuntu.com/legal/terms-and-policies/privacy-policy\" VERSION_CODENAME=focal UBUNTU_CODENAME=focal Where does the image come from? If the image ubuntu was not on your computer yet, docker will search and try to get them from dockerhub , and download it. Exercise: Run the command whoami in the docker container. Who are you? Answer The command whoami returns the current user. In the container whoami will return root . This means you are the root user i.e. within the container you are admin and can basically change anything. Check out the container panel at the Docker dashboard (the Docker gui) or open another host terminal and type: docker container ls -a Exercise: What is the container status? Answer In Docker dashboard you can see that the shell is running: The output of docker container ls -a is: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 27f7d11608de ubuntu \"/bin/bash\" 7 minutes ago Up 6 minutes great_moser Also showing you that the STATUS is Up . Now let\u2019s install some software in our ubuntu environment. We\u2019ll install some simple software called figlet . Type into the container shell: apt-get update apt-get install figlet This will give some warnings This installation will give some warnings. It\u2019s safe to ignore them. Now let\u2019s try it out. Type into the container shell: figlet 'SIB courses are great!' Now you have installed and used software figlet in an ubuntu environment (almost) completely separated from your host computer. This already gives you an idea of the power of containerization. Exit the shell by typing exit . Check out the container panel of Docker dashboard or type: docker container ls -a Exercise: What is the container status? Answer docker container ls -a gives: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 27f7d11608de ubuntu \"/bin/bash\" 15 minutes ago Exited (0) 8 seconds ago great_moser Showing that the container has exited, meaning it\u2019s not running.","title":"Introduction to containers"},{"location":"course_material/day1/introduction_containers/#learning-outcomes","text":"After having completed this chapter you will be able to: Discriminate between an image and a container Run a docker container from dockerhub interactively Validate the available containers and their status","title":"Learning outcomes"},{"location":"course_material/day1/introduction_containers/#material","text":"General introduction: Download the presentation Introduction to containers: Download the presentation","title":"Material"},{"location":"course_material/day1/introduction_containers/#exercises","text":"We recommend using a code editor like VScode or Sublime text. If you don\u2019t know which one to chose, take VScode as we can provide most support for this editor. If working on Windows If you are working on Windows, it is easiest to work with WSL2 . With VScode use the WSL extension . Make sure you install the latest versions before you install docker. In principle, you can also use a native shell like PowerShell, but this might result into some issues with bind mounting directories. Work in projects We recommend to work in a project folder. This will make it easier to find your files and to share them with others. You can create a project folder anywhere on your computer. For example, you can create a folder projects in your home directory and then create a subfolder docker-snakemake-course in it. You can then open this folder in VScode. Let\u2019s create our first container from an existing image. We do this with the image ubuntu , generating an environment with a minimal installation of ubuntu. docker run -it ubuntu This will give you an interactive shell into the created container (this interactivity was invoked by the options -i and -t ) . Exercise: Check out the operating system of the container by typing cat /etc/os-release in the container\u2019s shell. Are we really in an ubuntu environment? Answer Yes: root@27f7d11608de:/# cat /etc/os-release NAME=\"Ubuntu\" VERSION=\"20.04.1 LTS (Focal Fossa)\" ID=ubuntu ID_LIKE=debian PRETTY_NAME=\"Ubuntu 20.04.1 LTS\" VERSION_ID=\"20.04\" HOME_URL=\"https://www.ubuntu.com/\" SUPPORT_URL=\"https://help.ubuntu.com/\" BUG_REPORT_URL=\"https://bugs.launchpad.net/ubuntu/\" PRIVACY_POLICY_URL=\"https://www.ubuntu.com/legal/terms-and-policies/privacy-policy\" VERSION_CODENAME=focal UBUNTU_CODENAME=focal Where does the image come from? If the image ubuntu was not on your computer yet, docker will search and try to get them from dockerhub , and download it. Exercise: Run the command whoami in the docker container. Who are you? Answer The command whoami returns the current user. In the container whoami will return root . This means you are the root user i.e. within the container you are admin and can basically change anything. Check out the container panel at the Docker dashboard (the Docker gui) or open another host terminal and type: docker container ls -a Exercise: What is the container status? Answer In Docker dashboard you can see that the shell is running: The output of docker container ls -a is: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 27f7d11608de ubuntu \"/bin/bash\" 7 minutes ago Up 6 minutes great_moser Also showing you that the STATUS is Up . Now let\u2019s install some software in our ubuntu environment. We\u2019ll install some simple software called figlet . Type into the container shell: apt-get update apt-get install figlet This will give some warnings This installation will give some warnings. It\u2019s safe to ignore them. Now let\u2019s try it out. Type into the container shell: figlet 'SIB courses are great!' Now you have installed and used software figlet in an ubuntu environment (almost) completely separated from your host computer. This already gives you an idea of the power of containerization. Exit the shell by typing exit . Check out the container panel of Docker dashboard or type: docker container ls -a Exercise: What is the container status? Answer docker container ls -a gives: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 27f7d11608de ubuntu \"/bin/bash\" 15 minutes ago Exited (0) 8 seconds ago great_moser Showing that the container has exited, meaning it\u2019s not running.","title":"Exercises"},{"location":"course_material/day1/managing_docker/","text":"Learning outcomes After having completed this chapter you will be able to: Explain the concept of layers in the context of docker containers and images Use the command line to restart and re-attach to an exited container Create a new image with docker commit List locally available images with docker image ls Run a command inside a container non-interactively Use docker image inspect to get more information on an image Use the command line to prune dangling images and stopped containers Rename and tag a docker image Push a newly created image to dockerhub Use the option --mount to bind mount a host directory to a container Material Download the presentation Overview of how docker works More on bind mounts Docker volumes in general Exercises Restarting an exited container If you would like to go back to your container with the figlet installation, you could try to run again: docker run -it ubuntu Exercise: Run the above command. Is your figlet installation still there? Why? Hint Check the status of your containers: docker container ls -a Answer No, the installation is gone. Another container was created from the same ubuntu image, without the figlet installation. Running the command docker container ls -a results in: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 8d7c4c611b70 ubuntu \"/bin/bash\" About a minute ago Up About a minute kind_mendel 27f7d11608de ubuntu \"/bin/bash\" 27 minutes ago Exited (0) 2 minutes ago great_moser In this case the container great_moser contains the figlet installation. But we have exited that container. We created a new container ( kind_mendel in this case) with a fresh environment created from the original ubuntu image. To restart your first created container, you\u2019ll have to look up its name. You can find it in the Docker dashboard, or with docker container ls -a . Container names The container name is the funny combination of two words separated by _ , e.g.: nifty_sinoussi . Alternatively you can use the container ID (the first column of the output of docker container ls ) To restart a container you can use: docker start [ CONTAINER NAME ] And after that to re-attach to the shell: docker attach [ CONTAINER NAME ] And you\u2019re back in the container shell. Exercise: Run the docker start and docker attach commands for the container that is supposed to contain the figlet installation. Is the installation of figlet still there? Answer yes: figlet 'try some more text!' Should give you output. docker attach and docker exec In addition to docker attach , you can also \u201cre-attach\u201d a container with docker exec . However, these two are quite different. While docker attach gets you back to your stopped shell process, docker exec creates a new one (more information on stackoverflow ). The command docker exec enables you therefore to have multiple shells open in the same container. That can be convenient if you have one shell open with a program running in the foreground, and another one for e.g. monitoring. An example for using docker exec on a running container: docker exec -it [ CONTAINER NAME ] /bin/bash Note that docker exec requires a CMD, it doesn\u2019t use the default. Creating a new image You can store your changes and create a new image based on the ubuntu image like this: docker commit [ CONTAINER NAME ] ubuntu-figlet Exercise: Run the above command with the name of the container containing the figlet installation. Check out docker image ls . What have we just created? Answer A new image called ubuntu-figlet based on the status of the container. The output of docker image ls should look like: REPOSITORY TAG IMAGE ID CREATED SIZE ubuntu-figlet latest e08b999c7978 4 seconds ago 101MB ubuntu latest f63181f19b2f 29 hours ago 72.9MB Now you can generate a new container based on the new image: docker run -it ubuntu-figlet Exercise: Run the above command. Is the figlet installation in the created container? Answer yes Commands The second positional argument of docker run can be a command followed by its arguments. So, we could run a container non-interactively (without -it ), and just let it run a single command: docker run ubuntu-figlet figlet 'non-interactive run' Resulting in just the output of the figlet command. In the previous exercises we have run containers without a command as positional argument. This doesn\u2019t mean that no command has been run, because the container would do nothing without a command. The default command is stored in the image, and you can find it by docker image inspect [IMAGE NAME] . Exercise: Have a look at the output of docker image inspect , particularly at \"Config\" (ignore \"ContainerConfig\" for now). What is the default command ( CMD ) of the ubuntu image? Answer Running docker image inspect ubuntu gives (amongst other information): \"Cmd\" : [ \"/bin/bash\" ] , In the case of the ubuntu the default command is bash , returning a shell in bash (i.e. Bourne again shell ). Adding the options -i and -t ( -it ) to your docker run command will therefore result in an interactive bash shell. You can modify this default behaviour. More on that later, when we will work on Dockerfiles . The difference between Config and ContainerConfig The configuration at Config represents the image, the configuration at ContainerConfig the last step during the build of the image, i.e. the last layer. More info e.g. at this post at stackoverflow . Removing containers In the meantime, with every call of docker run we have created a new container (check your containers with docker container ls -a ). You probably don\u2019t want to remove those one-by-one. These two commands are very useful to clean up your Docker cache: docker container prune : removes stopped containers docker image prune : removes dangling images (i.e. images without a name) So, remove your stopped containers with: docker container prune Unless you\u2019re developing further on a container, or you\u2019re using it for an analysis, you probably want to get rid of it once you have exited the container. You can do this with adding --rm to your docker run command, e.g.: docker run --rm ubuntu-figlet figlet 'non-interactive run' Pushing to dockerhub Now that we have created our first own docker image, we can store it and share it with the world on docker hub. Before we get there, we first have to (re)name and tag it. Before pushing an image to dockerhub, docker has to know to which user and which repository the image should be added. That information should be in the name of the image, like this: user/imagename . We can rename an image with docker tag (which is a bit of misleading name for the command). So we could push to dockerhub like this: docker tag ubuntu-figlet [USER NAME]/ubuntu-figlet docker push [USER NAME]/ubuntu-figlet If on Linux If you are on Linux and haven\u2019t connected to docker hub before, you will have login first. To do that, run: docker login How docker makes money All images pushed to dockerhub are open to the world. With a free account you can have one image on dockerhub that is private. Paid accounts can have more private images, and are therefore popular for commercial organisations. As an alternative to dockerhub, you can store images locally with docker save . We didn\u2019t specify the tag for our new image. That\u2019s why docker tag gave it the default tag called latest . Pushing an image without a tag will overwrite the current image with the tag latest (more on (not) using latest here ). If you want to maintain multiple versions of your image, you will have to add a tag, and push the image with that tag to dockerhub: docker tag ubuntu-figlet [USER NAME]/ubuntu-figlet:v1 docker push [USER NAME]/ubuntu-figlet:v1 Mounting a directory For many analyses you do calculations with files or scripts that are on your host (local) computer. But how do you make them available to a docker container? You can do that in several ways, but here we will use bind-mount. You can bind-mount a directory with -v ( --volume ) or --mount . Most old-school docker users will use -v , but --mount syntax is easier to understand and now recommended, so we will use the latter here: docker run \\ --mount type = bind,source = /host/source/path,target = /path/in/container \\ [ IMAGE ] The target directory will be created if it does not yet exist. The source directory should exist. MobaXterm users You can specify your local path with the Windows syntax (e.g. C:\\Users\\myusername ). However, you will have to use forward slashes ( / ) instead of backward slashes ( \\ ). Therefore, mounting a directory would look like: docker run \\ --mount type = bind,source = C:/Users/myusername,target = /path/in/container \\ [ IMAGE ] Do not use autocompletion or variable substitution (e.g. $PWD ) in MobaXterm, since these point to \u2018emulated\u2019 paths, and are not passed properly to the docker command. Using docker from Windows PowerShell Most of the syntax for docker is the same for both PowerShell and UNIX-based systems. However, there are some differences, e.g. in Windows, directories in file paths are separated by \\ instead of / . Also, line breaks are not escaped by \\ but by `. Exercise: Mount a host (local) directory to a target directory /working_dir in a container created from the ubuntu-figlet image and run it interactively. Check whether the target directory has been created. Answer e.g. on Mac OS this would be: docker run \\ -it \\ --mount type = bind,source = /Users/myusername/working_dir,target = /working_dir/ \\ ubuntu-figlet This creates a directory called working_dir in the root directory ( / ): root@8d80a8698865:/# ls bin dev home lib32 libx32 mnt proc run srv tmp var boot etc lib lib64 media opt root sbin sys usr working_dir This mounted directory is both available for the host (locally) and for the container. You can therefore e.g. copy files in there, and write output generated by the container. Exercise: Write the output of figlet \"testing mounted dir\" to a file in /working_dir . Check whether it is available on the host (locally) in the source directory. Hint You can write the output of figlet to a file like this: figlet 'some string' > file.txt Answer root@8d80a8698865:/# figlet 'testing mounted dir' > /working_dir/figlet_output.txt This should create a file in both your host (local) source directory and the target directory in the container called figlet_output.txt . Using files on the host This of course also works the other way around. If you would have a file on the host with e.g. a text, you can copy it into your mounted directory, and it will be available to the container. Managing permissions (extra) Depending on your system, the user ID and group ID will be taken over from the user inside the container. If the user inside the container is root, this will be root. That\u2019s a bit inconvenient if you just want to run the container as a regular user (for example in certain circumstances your container could write in / ). To do that, use the -u option, and specify the group ID and user ID like this: docker run -u [ uid ] : [ gid ] So, e.g.: docker run \\ -it \\ -u 1000 :1000 \\ --mount type = bind,source = /Users/myusername/working_dir,target = /working_dir/ \\ ubuntu-figlet If you want docker to take over your current uid and gid, you can use: docker run -u \"$(id -u):$(id -g)\" This behaviour is different on MacOS and MobaXterm On MacOS and in the local shell of MobaXterm the uid and gid are taken over from the user running the container (even if you set -u as 0:0), i.e. your current ID. More info on stackoverflow . Exercise: Start an interactive container based on the ubuntu-figlet image, bind-mount a local directory and take over your current uid and gid . Write the output of a figlet command to a file in the mounted directory. Who and which group owns the file inside the container? And outside the container? Answer the same question but now run the container without setting -u . Answer Linux MacOS MobaXterm Running ubuntu-figlet interactively while taking over uid and gid and mounting my current directory: docker run -it --mount type = bind,source = $PWD ,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=1000 gid=1000 groups=1000 So, I have taken over uid 1000 and gid 1000. I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 1000 1000 0 Mar 400 13:37 uid_set.txt So the file belongs to user 1000, and group 1000. Outside container: ubuntu@ip-172-31-33-21:~$ ls -lh -rw-r--r-- 1 ubuntu ubuntu 400 Mar 5 13:37 uid_set.txt Which makes sense: ubuntu@ip-172-31-33-21:~$ id uid=1000(ubuntu) gid=1000(ubuntu) groups=1000(ubuntu) Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = $PWD ,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, uid and gid are root . root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 1000 1000 400 Mar 5 13:37 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:40 uid_unset.txt Outside container: ubuntu@ip-172-31-33-21:~$ ls -lh -rw-r--r-- 1 ubuntu ubuntu 0 Mar 5 13:37 uid_set.txt -rw-r--r-- 1 root root 0 Mar 5 13:40 uid_unset.txt So, the uid and gid 0 (root:root) are taken over. Running ubuntu-figlet interactively while taking over uid and gid and mounting my current directory: docker run -it --mount type = bind,source = $PWD ,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=503 gid=20(dialout) groups=20(dialout) So, the container has taken over uid 503 and group 20 I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 503 dialout 400 Mar 5 13:11 uid_set.txt So the file belongs to user 503, and the group dialout . Outside container: mac-34392:~ geertvangeest$ ls -lh -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:11 uid_set.txt Which are the same as inside the container: mac-34392:~ geertvangeest$ echo \"$(id -u):$(id -g)\" 503:20 The uid 503 was nameless in the docker container. However the group 20 already existed in the ubuntu container, and was named dialout . Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = $PWD ,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, inside the container I am root . Creating new files will lead to ownership of root inside the container: root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 503 dialout 400 Mar 5 13:11 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:25 uid_unset.txt Outside container: mac-34392:~ geertvangeest$ ls -lh -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:11 uid_set.txt -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:15 uid_unset.txt So, the uid and gid 0 (root:root) are not taken over. Instead, the uid and gid of the user running docker were used. Running ubuntu-figlet interactively while taking over uid and gid and mounting to a specfied directory: docker run -it --mount type = bind,source = C:/Users/geert/data,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=1003 gid=513 groups=513 So, the container has taken over uid 1003 and group 513 I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 1003 513 400 Mar 5 13:11 uid_set.txt So the file belongs to user 1003, and the group 513. Outside container: /home/mobaxterm/data$ ls -lh -rwx------ 1 geert UserGrp 400 Mar 5 14:11 uid_set.txt Which are the same as inside the container: /home/mobaxterm/data$ echo \"$(id -u):$(id -g)\" 1003:513 Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = C:/Users/geert/data,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, inside the container I am root . Creating new files will lead to ownership of root inside the container: root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 1003 503 400 Mar 5 13:11 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:25 uid_unset.txt Outside container: /home/mobaxterm/data$ ls -lh -rwx------ 1 geert UserGrp 400 Mar 5 14:11 uid_set.txt -rwx------ 1 geert UserGrp 400 Mar 5 14:15 uid_unset.txt So, the uid and gid 0 (root:root) are not taken over. Instead, the uid and gid of the user running docker were used.","title":"Managing containers and images"},{"location":"course_material/day1/managing_docker/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain the concept of layers in the context of docker containers and images Use the command line to restart and re-attach to an exited container Create a new image with docker commit List locally available images with docker image ls Run a command inside a container non-interactively Use docker image inspect to get more information on an image Use the command line to prune dangling images and stopped containers Rename and tag a docker image Push a newly created image to dockerhub Use the option --mount to bind mount a host directory to a container","title":"Learning outcomes"},{"location":"course_material/day1/managing_docker/#material","text":"Download the presentation Overview of how docker works More on bind mounts Docker volumes in general","title":"Material"},{"location":"course_material/day1/managing_docker/#exercises","text":"","title":"Exercises"},{"location":"course_material/day1/managing_docker/#restarting-an-exited-container","text":"If you would like to go back to your container with the figlet installation, you could try to run again: docker run -it ubuntu Exercise: Run the above command. Is your figlet installation still there? Why? Hint Check the status of your containers: docker container ls -a Answer No, the installation is gone. Another container was created from the same ubuntu image, without the figlet installation. Running the command docker container ls -a results in: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 8d7c4c611b70 ubuntu \"/bin/bash\" About a minute ago Up About a minute kind_mendel 27f7d11608de ubuntu \"/bin/bash\" 27 minutes ago Exited (0) 2 minutes ago great_moser In this case the container great_moser contains the figlet installation. But we have exited that container. We created a new container ( kind_mendel in this case) with a fresh environment created from the original ubuntu image. To restart your first created container, you\u2019ll have to look up its name. You can find it in the Docker dashboard, or with docker container ls -a . Container names The container name is the funny combination of two words separated by _ , e.g.: nifty_sinoussi . Alternatively you can use the container ID (the first column of the output of docker container ls ) To restart a container you can use: docker start [ CONTAINER NAME ] And after that to re-attach to the shell: docker attach [ CONTAINER NAME ] And you\u2019re back in the container shell. Exercise: Run the docker start and docker attach commands for the container that is supposed to contain the figlet installation. Is the installation of figlet still there? Answer yes: figlet 'try some more text!' Should give you output. docker attach and docker exec In addition to docker attach , you can also \u201cre-attach\u201d a container with docker exec . However, these two are quite different. While docker attach gets you back to your stopped shell process, docker exec creates a new one (more information on stackoverflow ). The command docker exec enables you therefore to have multiple shells open in the same container. That can be convenient if you have one shell open with a program running in the foreground, and another one for e.g. monitoring. An example for using docker exec on a running container: docker exec -it [ CONTAINER NAME ] /bin/bash Note that docker exec requires a CMD, it doesn\u2019t use the default.","title":"Restarting an exited container"},{"location":"course_material/day1/managing_docker/#creating-a-new-image","text":"You can store your changes and create a new image based on the ubuntu image like this: docker commit [ CONTAINER NAME ] ubuntu-figlet Exercise: Run the above command with the name of the container containing the figlet installation. Check out docker image ls . What have we just created? Answer A new image called ubuntu-figlet based on the status of the container. The output of docker image ls should look like: REPOSITORY TAG IMAGE ID CREATED SIZE ubuntu-figlet latest e08b999c7978 4 seconds ago 101MB ubuntu latest f63181f19b2f 29 hours ago 72.9MB Now you can generate a new container based on the new image: docker run -it ubuntu-figlet Exercise: Run the above command. Is the figlet installation in the created container? Answer yes","title":"Creating a new image"},{"location":"course_material/day1/managing_docker/#commands","text":"The second positional argument of docker run can be a command followed by its arguments. So, we could run a container non-interactively (without -it ), and just let it run a single command: docker run ubuntu-figlet figlet 'non-interactive run' Resulting in just the output of the figlet command. In the previous exercises we have run containers without a command as positional argument. This doesn\u2019t mean that no command has been run, because the container would do nothing without a command. The default command is stored in the image, and you can find it by docker image inspect [IMAGE NAME] . Exercise: Have a look at the output of docker image inspect , particularly at \"Config\" (ignore \"ContainerConfig\" for now). What is the default command ( CMD ) of the ubuntu image? Answer Running docker image inspect ubuntu gives (amongst other information): \"Cmd\" : [ \"/bin/bash\" ] , In the case of the ubuntu the default command is bash , returning a shell in bash (i.e. Bourne again shell ). Adding the options -i and -t ( -it ) to your docker run command will therefore result in an interactive bash shell. You can modify this default behaviour. More on that later, when we will work on Dockerfiles . The difference between Config and ContainerConfig The configuration at Config represents the image, the configuration at ContainerConfig the last step during the build of the image, i.e. the last layer. More info e.g. at this post at stackoverflow .","title":"Commands"},{"location":"course_material/day1/managing_docker/#removing-containers","text":"In the meantime, with every call of docker run we have created a new container (check your containers with docker container ls -a ). You probably don\u2019t want to remove those one-by-one. These two commands are very useful to clean up your Docker cache: docker container prune : removes stopped containers docker image prune : removes dangling images (i.e. images without a name) So, remove your stopped containers with: docker container prune Unless you\u2019re developing further on a container, or you\u2019re using it for an analysis, you probably want to get rid of it once you have exited the container. You can do this with adding --rm to your docker run command, e.g.: docker run --rm ubuntu-figlet figlet 'non-interactive run'","title":"Removing containers"},{"location":"course_material/day1/managing_docker/#pushing-to-dockerhub","text":"Now that we have created our first own docker image, we can store it and share it with the world on docker hub. Before we get there, we first have to (re)name and tag it. Before pushing an image to dockerhub, docker has to know to which user and which repository the image should be added. That information should be in the name of the image, like this: user/imagename . We can rename an image with docker tag (which is a bit of misleading name for the command). So we could push to dockerhub like this: docker tag ubuntu-figlet [USER NAME]/ubuntu-figlet docker push [USER NAME]/ubuntu-figlet If on Linux If you are on Linux and haven\u2019t connected to docker hub before, you will have login first. To do that, run: docker login How docker makes money All images pushed to dockerhub are open to the world. With a free account you can have one image on dockerhub that is private. Paid accounts can have more private images, and are therefore popular for commercial organisations. As an alternative to dockerhub, you can store images locally with docker save . We didn\u2019t specify the tag for our new image. That\u2019s why docker tag gave it the default tag called latest . Pushing an image without a tag will overwrite the current image with the tag latest (more on (not) using latest here ). If you want to maintain multiple versions of your image, you will have to add a tag, and push the image with that tag to dockerhub: docker tag ubuntu-figlet [USER NAME]/ubuntu-figlet:v1 docker push [USER NAME]/ubuntu-figlet:v1","title":"Pushing to dockerhub"},{"location":"course_material/day1/managing_docker/#mounting-a-directory","text":"For many analyses you do calculations with files or scripts that are on your host (local) computer. But how do you make them available to a docker container? You can do that in several ways, but here we will use bind-mount. You can bind-mount a directory with -v ( --volume ) or --mount . Most old-school docker users will use -v , but --mount syntax is easier to understand and now recommended, so we will use the latter here: docker run \\ --mount type = bind,source = /host/source/path,target = /path/in/container \\ [ IMAGE ] The target directory will be created if it does not yet exist. The source directory should exist. MobaXterm users You can specify your local path with the Windows syntax (e.g. C:\\Users\\myusername ). However, you will have to use forward slashes ( / ) instead of backward slashes ( \\ ). Therefore, mounting a directory would look like: docker run \\ --mount type = bind,source = C:/Users/myusername,target = /path/in/container \\ [ IMAGE ] Do not use autocompletion or variable substitution (e.g. $PWD ) in MobaXterm, since these point to \u2018emulated\u2019 paths, and are not passed properly to the docker command. Using docker from Windows PowerShell Most of the syntax for docker is the same for both PowerShell and UNIX-based systems. However, there are some differences, e.g. in Windows, directories in file paths are separated by \\ instead of / . Also, line breaks are not escaped by \\ but by `. Exercise: Mount a host (local) directory to a target directory /working_dir in a container created from the ubuntu-figlet image and run it interactively. Check whether the target directory has been created. Answer e.g. on Mac OS this would be: docker run \\ -it \\ --mount type = bind,source = /Users/myusername/working_dir,target = /working_dir/ \\ ubuntu-figlet This creates a directory called working_dir in the root directory ( / ): root@8d80a8698865:/# ls bin dev home lib32 libx32 mnt proc run srv tmp var boot etc lib lib64 media opt root sbin sys usr working_dir This mounted directory is both available for the host (locally) and for the container. You can therefore e.g. copy files in there, and write output generated by the container. Exercise: Write the output of figlet \"testing mounted dir\" to a file in /working_dir . Check whether it is available on the host (locally) in the source directory. Hint You can write the output of figlet to a file like this: figlet 'some string' > file.txt Answer root@8d80a8698865:/# figlet 'testing mounted dir' > /working_dir/figlet_output.txt This should create a file in both your host (local) source directory and the target directory in the container called figlet_output.txt . Using files on the host This of course also works the other way around. If you would have a file on the host with e.g. a text, you can copy it into your mounted directory, and it will be available to the container.","title":"Mounting a directory"},{"location":"course_material/day1/managing_docker/#managing-permissions-extra","text":"Depending on your system, the user ID and group ID will be taken over from the user inside the container. If the user inside the container is root, this will be root. That\u2019s a bit inconvenient if you just want to run the container as a regular user (for example in certain circumstances your container could write in / ). To do that, use the -u option, and specify the group ID and user ID like this: docker run -u [ uid ] : [ gid ] So, e.g.: docker run \\ -it \\ -u 1000 :1000 \\ --mount type = bind,source = /Users/myusername/working_dir,target = /working_dir/ \\ ubuntu-figlet If you want docker to take over your current uid and gid, you can use: docker run -u \"$(id -u):$(id -g)\" This behaviour is different on MacOS and MobaXterm On MacOS and in the local shell of MobaXterm the uid and gid are taken over from the user running the container (even if you set -u as 0:0), i.e. your current ID. More info on stackoverflow . Exercise: Start an interactive container based on the ubuntu-figlet image, bind-mount a local directory and take over your current uid and gid . Write the output of a figlet command to a file in the mounted directory. Who and which group owns the file inside the container? And outside the container? Answer the same question but now run the container without setting -u . Answer Linux MacOS MobaXterm Running ubuntu-figlet interactively while taking over uid and gid and mounting my current directory: docker run -it --mount type = bind,source = $PWD ,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=1000 gid=1000 groups=1000 So, I have taken over uid 1000 and gid 1000. I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 1000 1000 0 Mar 400 13:37 uid_set.txt So the file belongs to user 1000, and group 1000. Outside container: ubuntu@ip-172-31-33-21:~$ ls -lh -rw-r--r-- 1 ubuntu ubuntu 400 Mar 5 13:37 uid_set.txt Which makes sense: ubuntu@ip-172-31-33-21:~$ id uid=1000(ubuntu) gid=1000(ubuntu) groups=1000(ubuntu) Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = $PWD ,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, uid and gid are root . root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 1000 1000 400 Mar 5 13:37 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:40 uid_unset.txt Outside container: ubuntu@ip-172-31-33-21:~$ ls -lh -rw-r--r-- 1 ubuntu ubuntu 0 Mar 5 13:37 uid_set.txt -rw-r--r-- 1 root root 0 Mar 5 13:40 uid_unset.txt So, the uid and gid 0 (root:root) are taken over. Running ubuntu-figlet interactively while taking over uid and gid and mounting my current directory: docker run -it --mount type = bind,source = $PWD ,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=503 gid=20(dialout) groups=20(dialout) So, the container has taken over uid 503 and group 20 I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 503 dialout 400 Mar 5 13:11 uid_set.txt So the file belongs to user 503, and the group dialout . Outside container: mac-34392:~ geertvangeest$ ls -lh -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:11 uid_set.txt Which are the same as inside the container: mac-34392:~ geertvangeest$ echo \"$(id -u):$(id -g)\" 503:20 The uid 503 was nameless in the docker container. However the group 20 already existed in the ubuntu container, and was named dialout . Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = $PWD ,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, inside the container I am root . Creating new files will lead to ownership of root inside the container: root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 503 dialout 400 Mar 5 13:11 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:25 uid_unset.txt Outside container: mac-34392:~ geertvangeest$ ls -lh -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:11 uid_set.txt -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:15 uid_unset.txt So, the uid and gid 0 (root:root) are not taken over. Instead, the uid and gid of the user running docker were used. Running ubuntu-figlet interactively while taking over uid and gid and mounting to a specfied directory: docker run -it --mount type = bind,source = C:/Users/geert/data,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=1003 gid=513 groups=513 So, the container has taken over uid 1003 and group 513 I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 1003 513 400 Mar 5 13:11 uid_set.txt So the file belongs to user 1003, and the group 513. Outside container: /home/mobaxterm/data$ ls -lh -rwx------ 1 geert UserGrp 400 Mar 5 14:11 uid_set.txt Which are the same as inside the container: /home/mobaxterm/data$ echo \"$(id -u):$(id -g)\" 1003:513 Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = C:/Users/geert/data,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, inside the container I am root . Creating new files will lead to ownership of root inside the container: root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 1003 503 400 Mar 5 13:11 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:25 uid_unset.txt Outside container: /home/mobaxterm/data$ ls -lh -rwx------ 1 geert UserGrp 400 Mar 5 14:11 uid_set.txt -rwx------ 1 geert UserGrp 400 Mar 5 14:15 uid_unset.txt So, the uid and gid 0 (root:root) are not taken over. Instead, the uid and gid of the user running docker were used.","title":"Managing permissions (extra)"},{"location":"course_material/day1/singularity/","text":"Learning outcomes After having completed this chapter you will be able to: Login to a remote machine with ssh Use apptainer pull to convert an image from dockerhub to the \u2018apptainer image format\u2019 ( .sif ) Execute a apptainer container Explain the difference in default mounting behaviour between docker and apptainer Use apptainer shell to generate an interactive shell inside a .sif image Search and use images with both docker and apptainer from bioconda Material Download the presentation Apptainer documentation Apptainer hub An article on Docker vs Apptainer Using conda and containers with snakemake Exercises Login to remote If you are enrolled in the course, you have received an e-mail with an IP, username, private key and password. To do the Apptainer exercises we will login to a remote server. Below you can find instructions on how to login. VScode is a code editor that can be used to edit files and run commands locally, but also on a remote server. In this subchapter we will set up VScode to work remotely. If not working with VScode If you are not working with VScode, you can login to the remote server with the following command: ssh -i key_username.pem If you want to edit files directly on the server, you can mount a directory with sshfs . Required installations For this exercise it is easiest if you use VScode . In addition you would need to have followed the instructions to set up remote-ssh: OpenSSH compatible client . This is usually pre-installed on your OS. You can check whether the command ssh exists. The Remote-SSH extension. To install, open VSCode and click on the extensions icon (four squares) on the left side of the window. Search for Remote-SSH and click on Install . Windows mac OS/Linux Open a PowerShell and cd to the directory where you have stored your private key. After that, move it to ~\\.ssh : mv .\\ key_username . pem ~\\. ssh Open a terminal, and cd to the directory where you have stored your private key. After that, change the file permissions of the key and move it to ~/.ssh : chmod 400 key_username.pem mv key_username.pem ~/.ssh Open VScode and click on the green or blue button in the bottom left corner. Select Connect to Host... , and then on Configure SSH Host... . Specify a the location for the config file. Use the same directory as where your keys are stored (so ~/.ssh ). A skeleton config file will be provided. Edit it, so it looks like this (replace username with your username, and specify the correct IP at HostName ): Windows MacOS/Linux Host sib_course_remote User username HostName 123.456.789.123 IdentityFile ~\\.ssh\\key_username.pem Host sib_course_remote User username HostName 123.456.789.123 IdentityFile ~/.ssh/key_username.pem Save and close the config file. Now click again the green or blue button in the bottom left corner. Select Connect to Host... , and then on sib_course_remote . You will be asked which operating system is used on the remote. Specify \u2018Linux\u2019. Pulling an image Apptainer can take several image formats (e.g. a docker image), and convert them into it\u2019s own .sif format. Unlike docker this image doesn\u2019t live in a local image cache, but it\u2019s stored as an actual file. Exercise: On the remote server, pull the docker image that has the adjusted default CMD that we have pushed to dockerhub in this exercise ( ubuntu-figlet-df:v3 ) with apptainer pull . The syntax is: apptainer pull docker:// [ USER NAME ] / [ IMAGE NAME ] : [ TAG ] Answer apptainer pull docker:// [ USER NAME ] /ubuntu-figlet:v3 This will result in a file called ubuntu-figlet_v3.sif Note If you weren\u2019t able to push the image in the previous exercises to your docker hub, you can use geertvangeest as username to pull the image. Executing an image These .sif files can be run as standalone executables: ./ubuntu-figlet_v3.sif Note This is shorthand for: apptainer run ubuntu-figlet_v3.sif And you can overwrite the default command like this: apptainer run [ IMAGE NAME ] .sif [ COMMAND ] Note In this case, you can also use ./ [ IMAGE NAME ] .sif [ COMMAND ] However, most applications require apptainer run . Especially if you want to provide options like --bind (for mounting directories). Exercise: Run the .sif file without a command, and with a command that runs figlet . Do you get expected output? Do the same for the R image you\u2019ve created in the previous chapter. Entrypoint and apptainer The daterange image has an entrypoint set, and apptainer run does not overwrite it. In order to ignore both the entrypoint and cmd use apptainer exec . Answer Running it without a command ( ./ubuntu-figlet_v3.sif ) should give: __ __ _ _ _ | \\/ |_ _ (_)_ __ ___ __ _ __ _ ___ __ _____ _ __| | _____| | | |\\/| | | | | | | '_ ` _ \\ / _` |/ _` |/ _ \\ \\ \\ /\\ / / _ \\| '__| |/ / __| | | | | | |_| | | | | | | | | (_| | (_| | __/ \\ V V / (_) | | | <\\__ \\_| |_| |_|\\__, | |_|_| |_| |_|\\__,_|\\__, |\\___| \\_/\\_/ \\___/|_| |_|\\_\\___(_) |___/ |___/ Which is the default command that we changed in the Dockerfile . Running with a another figlet command: ./ubuntu-figlet_v3.sif figlet 'Something else' Should give: ____ _ _ _ _ / ___| ___ _ __ ___ ___| |_| |__ (_)_ __ __ _ ___| |___ ___ \\___ \\ / _ \\| '_ ` _ \\ / _ \\ __| '_ \\| | '_ \\ / _` | / _ \\ / __|/ _ \\ ___) | (_) | | | | | | __/ |_| | | | | | | | (_| | | __/ \\__ \\ __/ |____/ \\___/|_| |_| |_|\\___|\\__|_| |_|_|_| |_|\\__, | \\___|_|___/\\___| |___/ Pulling the deseq2 image: apptainer pull docker:// [ USER NAME ] /deseq2:v1 Running it without command: ./deseq2.sif Running with a command: ./deseq2.sif --rows 100 To overwrite both entrypoint and the command: apptainer exec deseq2.sif test_deseq2.R --rows 200 Mounting with Apptainer Apptainer is also different from Docker in the way it handles mounting. By default, Apptainer binds your home directory and a number of paths in the root directory to the container. This results in behaviour that is almost like if you are working on the directory structure of the host. If your directory is not mounted by default It depends on the apptainer settings whether most directories are mounted by default to the container. If your directory is not mounted, you can do that with the --bind option of apptainer exec : apptainer exec --bind /my/dir/to/mount/ [ IMAGE NAME ] .sif [ COMMAND ] Running the command pwd (full name of current working directory) will therefore result in a path on the host machine: ./ubuntu-figlet_v3.sif pwd Exercise: Run the above command. What is the output? How would the output look like if you would run a similar command with Docker? Hint A similar Docker command would look like (run this on your local computer): docker run --rm ubuntu-figlet:v3 pwd Answer The output of ./ubuntu-figlet_v3.sif pwd is the current directory on the host: i.e. /home/username if you have it in your home directory. The output of docker run --rm ubuntu-figlet:v3 pwd (on the local host) would be / , which is the default workdir (root directory) of the container. As we did not mount any host directory, this directory exists only within the container (i.e. separated from the host). Interactive shell If you want to debug or inspect an image, it can be helpful to have a shell inside the container. You can do that with apptainer shell : apptainer shell ubuntu-figlet_v3.sif Note To exit the shell type exit . Exercise: Can you run figlet inside this shell? Answer Yes: Apptainer> figlet test _ _ | |_ ___ ___| |_ | __/ _ \\/ __| __| | || __/\\__ \\ |_ \\__\\___||___/\\__| During the lecture you have learned that apptainer takes over the user privileges of the user on the host. You can get user information with command like whoami , id , groups etc. Exercise: Run the figlet container interactively. Do you have the same user privileges as if you were on the host? How is that with docker ? Answer A command like whoami will result in your username printed at stdout: Apptainer> whoami myusername Apptainer> id uid=1030(myusername) gid=1031(myusername) groups=1031(myusername),1001(condausers) Apptainer> groups myusername condausers With apptainer, you have the same privileges inside the apptainer container as on the host. If you do this in the docker container (based on the same image), you\u2019ll get output like this: root@a3d6e59dc19d:/# whoami root root@a3d6e59dc19d:/# groups root root@a3d6e59dc19d:/# id uid=0(root) gid=0(root) groups=0(root) A bioinformatics example (extra) All bioconda packages also have a pre-built container. Have a look at the bioconda website , and search for fastqc . In the search results, click on the appropriate record (i.e. package \u2018fastqc\u2019). Now, scroll down and find the namespace and tag for the latest fastqc image. Now we can pull it with apptainer like this: apptainer pull docker://quay.io/biocontainers/fastqc:0.11.9--hdfd78af_1 Let\u2019s test the image. Download some sample reads first: mkdir reads cd reads wget https://introduction-containers.s3.eu-central-1.amazonaws.com/ecoli_reads.tar.gz tar -xzvf ecoli_reads.tar.gz rm ecoli_reads.tar.gz Now you can simply run the image as an executable preceding the commands you would like to run within the container. E.g. running fastqc would look like: cd ./fastqc_0.11.9--hdfd78af_1.sif fastqc ./reads/ecoli_*.fastq.gz This will result in html files in the directory ./reads . These are quality reports for the sequence reads. If you\u2019d like to view them, you can download them with scp or e.g. FileZilla , and view them with your local browser.","title":"Running containers with singularity"},{"location":"course_material/day1/singularity/#learning-outcomes","text":"After having completed this chapter you will be able to: Login to a remote machine with ssh Use apptainer pull to convert an image from dockerhub to the \u2018apptainer image format\u2019 ( .sif ) Execute a apptainer container Explain the difference in default mounting behaviour between docker and apptainer Use apptainer shell to generate an interactive shell inside a .sif image Search and use images with both docker and apptainer from bioconda","title":"Learning outcomes"},{"location":"course_material/day1/singularity/#material","text":"Download the presentation Apptainer documentation Apptainer hub An article on Docker vs Apptainer Using conda and containers with snakemake","title":"Material"},{"location":"course_material/day1/singularity/#exercises","text":"","title":"Exercises"},{"location":"course_material/day1/singularity/#login-to-remote","text":"If you are enrolled in the course, you have received an e-mail with an IP, username, private key and password. To do the Apptainer exercises we will login to a remote server. Below you can find instructions on how to login. VScode is a code editor that can be used to edit files and run commands locally, but also on a remote server. In this subchapter we will set up VScode to work remotely. If not working with VScode If you are not working with VScode, you can login to the remote server with the following command: ssh -i key_username.pem If you want to edit files directly on the server, you can mount a directory with sshfs . Required installations For this exercise it is easiest if you use VScode . In addition you would need to have followed the instructions to set up remote-ssh: OpenSSH compatible client . This is usually pre-installed on your OS. You can check whether the command ssh exists. The Remote-SSH extension. To install, open VSCode and click on the extensions icon (four squares) on the left side of the window. Search for Remote-SSH and click on Install . Windows mac OS/Linux Open a PowerShell and cd to the directory where you have stored your private key. After that, move it to ~\\.ssh : mv .\\ key_username . pem ~\\. ssh Open a terminal, and cd to the directory where you have stored your private key. After that, change the file permissions of the key and move it to ~/.ssh : chmod 400 key_username.pem mv key_username.pem ~/.ssh Open VScode and click on the green or blue button in the bottom left corner. Select Connect to Host... , and then on Configure SSH Host... . Specify a the location for the config file. Use the same directory as where your keys are stored (so ~/.ssh ). A skeleton config file will be provided. Edit it, so it looks like this (replace username with your username, and specify the correct IP at HostName ): Windows MacOS/Linux Host sib_course_remote User username HostName 123.456.789.123 IdentityFile ~\\.ssh\\key_username.pem Host sib_course_remote User username HostName 123.456.789.123 IdentityFile ~/.ssh/key_username.pem Save and close the config file. Now click again the green or blue button in the bottom left corner. Select Connect to Host... , and then on sib_course_remote . You will be asked which operating system is used on the remote. Specify \u2018Linux\u2019.","title":"Login to remote"},{"location":"course_material/day1/singularity/#pulling-an-image","text":"Apptainer can take several image formats (e.g. a docker image), and convert them into it\u2019s own .sif format. Unlike docker this image doesn\u2019t live in a local image cache, but it\u2019s stored as an actual file. Exercise: On the remote server, pull the docker image that has the adjusted default CMD that we have pushed to dockerhub in this exercise ( ubuntu-figlet-df:v3 ) with apptainer pull . The syntax is: apptainer pull docker:// [ USER NAME ] / [ IMAGE NAME ] : [ TAG ] Answer apptainer pull docker:// [ USER NAME ] /ubuntu-figlet:v3 This will result in a file called ubuntu-figlet_v3.sif Note If you weren\u2019t able to push the image in the previous exercises to your docker hub, you can use geertvangeest as username to pull the image.","title":"Pulling an image"},{"location":"course_material/day1/singularity/#executing-an-image","text":"These .sif files can be run as standalone executables: ./ubuntu-figlet_v3.sif Note This is shorthand for: apptainer run ubuntu-figlet_v3.sif And you can overwrite the default command like this: apptainer run [ IMAGE NAME ] .sif [ COMMAND ] Note In this case, you can also use ./ [ IMAGE NAME ] .sif [ COMMAND ] However, most applications require apptainer run . Especially if you want to provide options like --bind (for mounting directories). Exercise: Run the .sif file without a command, and with a command that runs figlet . Do you get expected output? Do the same for the R image you\u2019ve created in the previous chapter. Entrypoint and apptainer The daterange image has an entrypoint set, and apptainer run does not overwrite it. In order to ignore both the entrypoint and cmd use apptainer exec . Answer Running it without a command ( ./ubuntu-figlet_v3.sif ) should give: __ __ _ _ _ | \\/ |_ _ (_)_ __ ___ __ _ __ _ ___ __ _____ _ __| | _____| | | |\\/| | | | | | | '_ ` _ \\ / _` |/ _` |/ _ \\ \\ \\ /\\ / / _ \\| '__| |/ / __| | | | | | |_| | | | | | | | | (_| | (_| | __/ \\ V V / (_) | | | <\\__ \\_| |_| |_|\\__, | |_|_| |_| |_|\\__,_|\\__, |\\___| \\_/\\_/ \\___/|_| |_|\\_\\___(_) |___/ |___/ Which is the default command that we changed in the Dockerfile . Running with a another figlet command: ./ubuntu-figlet_v3.sif figlet 'Something else' Should give: ____ _ _ _ _ / ___| ___ _ __ ___ ___| |_| |__ (_)_ __ __ _ ___| |___ ___ \\___ \\ / _ \\| '_ ` _ \\ / _ \\ __| '_ \\| | '_ \\ / _` | / _ \\ / __|/ _ \\ ___) | (_) | | | | | | __/ |_| | | | | | | | (_| | | __/ \\__ \\ __/ |____/ \\___/|_| |_| |_|\\___|\\__|_| |_|_|_| |_|\\__, | \\___|_|___/\\___| |___/ Pulling the deseq2 image: apptainer pull docker:// [ USER NAME ] /deseq2:v1 Running it without command: ./deseq2.sif Running with a command: ./deseq2.sif --rows 100 To overwrite both entrypoint and the command: apptainer exec deseq2.sif test_deseq2.R --rows 200","title":"Executing an image"},{"location":"course_material/day1/singularity/#mounting-with-apptainer","text":"Apptainer is also different from Docker in the way it handles mounting. By default, Apptainer binds your home directory and a number of paths in the root directory to the container. This results in behaviour that is almost like if you are working on the directory structure of the host. If your directory is not mounted by default It depends on the apptainer settings whether most directories are mounted by default to the container. If your directory is not mounted, you can do that with the --bind option of apptainer exec : apptainer exec --bind /my/dir/to/mount/ [ IMAGE NAME ] .sif [ COMMAND ] Running the command pwd (full name of current working directory) will therefore result in a path on the host machine: ./ubuntu-figlet_v3.sif pwd Exercise: Run the above command. What is the output? How would the output look like if you would run a similar command with Docker? Hint A similar Docker command would look like (run this on your local computer): docker run --rm ubuntu-figlet:v3 pwd Answer The output of ./ubuntu-figlet_v3.sif pwd is the current directory on the host: i.e. /home/username if you have it in your home directory. The output of docker run --rm ubuntu-figlet:v3 pwd (on the local host) would be / , which is the default workdir (root directory) of the container. As we did not mount any host directory, this directory exists only within the container (i.e. separated from the host).","title":"Mounting with Apptainer"},{"location":"course_material/day1/singularity/#interactive-shell","text":"If you want to debug or inspect an image, it can be helpful to have a shell inside the container. You can do that with apptainer shell : apptainer shell ubuntu-figlet_v3.sif Note To exit the shell type exit . Exercise: Can you run figlet inside this shell? Answer Yes: Apptainer> figlet test _ _ | |_ ___ ___| |_ | __/ _ \\/ __| __| | || __/\\__ \\ |_ \\__\\___||___/\\__| During the lecture you have learned that apptainer takes over the user privileges of the user on the host. You can get user information with command like whoami , id , groups etc. Exercise: Run the figlet container interactively. Do you have the same user privileges as if you were on the host? How is that with docker ? Answer A command like whoami will result in your username printed at stdout: Apptainer> whoami myusername Apptainer> id uid=1030(myusername) gid=1031(myusername) groups=1031(myusername),1001(condausers) Apptainer> groups myusername condausers With apptainer, you have the same privileges inside the apptainer container as on the host. If you do this in the docker container (based on the same image), you\u2019ll get output like this: root@a3d6e59dc19d:/# whoami root root@a3d6e59dc19d:/# groups root root@a3d6e59dc19d:/# id uid=0(root) gid=0(root) groups=0(root)","title":"Interactive shell"},{"location":"course_material/day1/singularity/#a-bioinformatics-example-extra","text":"All bioconda packages also have a pre-built container. Have a look at the bioconda website , and search for fastqc . In the search results, click on the appropriate record (i.e. package \u2018fastqc\u2019). Now, scroll down and find the namespace and tag for the latest fastqc image. Now we can pull it with apptainer like this: apptainer pull docker://quay.io/biocontainers/fastqc:0.11.9--hdfd78af_1 Let\u2019s test the image. Download some sample reads first: mkdir reads cd reads wget https://introduction-containers.s3.eu-central-1.amazonaws.com/ecoli_reads.tar.gz tar -xzvf ecoli_reads.tar.gz rm ecoli_reads.tar.gz Now you can simply run the image as an executable preceding the commands you would like to run within the container. E.g. running fastqc would look like: cd ./fastqc_0.11.9--hdfd78af_1.sif fastqc ./reads/ecoli_*.fastq.gz This will result in html files in the directory ./reads . These are quality reports for the sequence reads. If you\u2019d like to view them, you can download them with scp or e.g. FileZilla , and view them with your local browser.","title":"A bioinformatics example (extra)"},{"location":"course_material/day2/1_guidelines/","text":"Workshop goal Over the course of the workshop, you will implement and improve a workflow to trim bulk RNAseq reads, align them on a genome, perform some quality checks (QC), count mapped reads, and identify Differentially Expressed Genes (DEG). The goal of the workshop is that after the last series of exercises, you will have implemented a simple workflow with commonly used Snakemake features. You will be able to use this workflow as a reference to implement your own workflows in the future. Software All the software needed in this workflow is either: Already installed in the snake_course conda environment Already installed in a Docker container Will be installed via a conda environment during today\u2019s exercises Exercises Each series of exercises is divided in multiple questions. We first provide a general explanation on the context behind each question; we then explicitly describe the task and provide details when they are required. We also provide hints that should help you with the most challenging parts of some questions. You should first try to solve the problems without using these hints! Do not hesitate to modify and overwrite your code from previous questions when specified in an exercise, as the solutions for each series of exercises are provided. If something is not clear at any point, please call us and we will do our best to answer your questions. You can also check the official Snakemake documentation for more information.","title":"General guidelines"},{"location":"course_material/day2/1_guidelines/#workshop-goal","text":"Over the course of the workshop, you will implement and improve a workflow to trim bulk RNAseq reads, align them on a genome, perform some quality checks (QC), count mapped reads, and identify Differentially Expressed Genes (DEG). The goal of the workshop is that after the last series of exercises, you will have implemented a simple workflow with commonly used Snakemake features. You will be able to use this workflow as a reference to implement your own workflows in the future.","title":"Workshop goal"},{"location":"course_material/day2/1_guidelines/#software","text":"All the software needed in this workflow is either: Already installed in the snake_course conda environment Already installed in a Docker container Will be installed via a conda environment during today\u2019s exercises","title":"Software"},{"location":"course_material/day2/1_guidelines/#exercises","text":"Each series of exercises is divided in multiple questions. We first provide a general explanation on the context behind each question; we then explicitly describe the task and provide details when they are required. We also provide hints that should help you with the most challenging parts of some questions. You should first try to solve the problems without using these hints! Do not hesitate to modify and overwrite your code from previous questions when specified in an exercise, as the solutions for each series of exercises are provided. If something is not clear at any point, please call us and we will do our best to answer your questions. You can also check the official Snakemake documentation for more information.","title":"Exercises"},{"location":"course_material/day2/2_introduction_snakemake/","text":"Learning outcomes After having completed this chapter you will be able to: Understand the structure of a Snakemake workflow Write rules and Snakefiles to produce the desired outputs Chain rules together Run a Snakemake workflow Structuring a workflow It is advised to implement your code in a directory called workflow (more information about this in the next series of exercises). You are free to choose the names and location of files for the different steps of your workflow, but, for now, we recommend that you at least group all outputs from the workflow in a results folder within the workflow directory. A small reminder about conda environment If you try to run a command and get an error such as Command 'snakemake' not found , you are probably not in the right environment. To list them, use conda env list . Then activate the right environment with conda activate . You can deactivate an environment with conda deactivate . To list the packages installed in an environment, activate it and use conda list . Exercises This series of exercises will bear no biological meaning, on purpose: it is designed to show you the fundamentals of Snakemake. Creating a basic rule Rules are the basic blocks of a Snakemake workflow. A rule is like a recipe indicating how to produce a specific output . The actual application of a rule to create an output is called a job . A rule is defined in a Snakefile with the keyword rule and contains directives which indicate the rule\u2019s properties. To create the simplest rule possible, we need at least two directives : output : path of the output file for this rule shell : shell commands to execute in order to generate the output We will see other directives later in the course. Exercise: The following example shows the minimal syntax to implement a rule. What do you think it does? Does it create a file? If so, how is it called? rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' Answer This rule uses the echo shell command to print the line \u201csnakemake\u201d in an output file called first_step.txt , located in the results folder. Rules are defined and written in a file called Snakefile (note the capital S and the absence of extension in the filename). This file should be located at the root of the workflow directory (here, workflow/Snakefile ). Exercise: Create a Snakefile and copy the previous rule in it. Because the Snakemake language is built on top of Python, spaces and indents are essential, so do not forget to keep the indentation as is and use space characters in the indents instead of tabs. Executing a workflow with a precise output It is now time to execute your first worklow! To do this, you need to tell Snakemake what is your target, i.e. what is the output that you want to generate. Exercise: Execute the workflow with snakemake --cores 1 . What value should you use for ? Once Snakemake execution is finished, can you locate the output file? Answer Execute the workflow: snakemake --cores 1 results/first_step.txt Visualise the content of the results folder: ls -alh results/ Check the output content: cat results/first_step.txt Note that during the execution of the workflow, Snakemake automatically created the missing folder ( results/ ) in the output path. If several folders are missing (for example, here, test1/test2/test3/first_step.txt ), Snakemake will create all of them . Exercise: Re-run the exact same command. What happens? Answer Nothing! We get a message saying that Snakemake did not run anything: ``` Building DAG of jobs... Nothing to be done (all requested files are present and up to date). ``` By default, Snakemake only runs a job if: * A target file explicitly requested in the `snakemake` command is missing * An intermediate file is missing and is required produce a target file * It notices input files newer than output files, based on file modification dates. In this case, Snakemake will generate again the existing outputs. We can change this behaviour and force the re-run of a specific target by using the `-f` option: `snakemake --cores 1 -f results/first_step.txt` or force recreate ALL the outputs of the workflow using the `-F` option: `snakemake --cores 1 -F`. In practice, we can also alter Snakemake (re-)run policy, but we will not cover this topic in the course (see [--rerun-triggers option](https://snakemake.readthedocs.io/en/stable/executing/cli.html) in Snakemake's CLI help and [this git issue](https://github.com/snakemake/snakemake/issues/1694) for more information). In the previous example, the values of the two rule directives are strings . For the shell directive (we will see other types of directive values later in the course), long string can be written on multiple lines for clarity, simply using a set of quotes for each line: rule first_step : output : 'results/first_step.txt' shell : 'echo \"I want to print a very very very very very very ' 'very very very very long string in my output\" > results/first_step.txt' Here, Snakemake will simply concatenate the two lines (paste each line one after the other) and execute the resulting command. Adding an input directive The next directive used by most rules is input . It indicates the path to a file that is required by the rule to generate the output. In the following example, we modified the previous rule to use the file previously created results/first_step.tsv as an input, and copy this file to results/second_step.txt : rule second_step : input : 'results/first_step.txt' output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' Note that with this rule definition, Snakemake will not run if results/first_step.tsv does not exist! Exercise: Modify your first rule to add an input directive and execute the workflow. Check that the output was created and that the files are identical. If you get a Missing input files for rule error, that means that the input file is missing and cannot be created. How can you solve this problem? Answer Execute the workflow: snakemake --cores 1 results/second_step.txt Visualise the content of the results folder: ls -alh results/ Check that the files are identical: diff results/first_step.txt results/second_step.txt If the input file is missing, you can create it with echo \u201csnakemake\u201d > results/first_step.txt and then execute the workflow. We will see later why this happened and how to avoid it! Creating a workflow with several rules Creating one Snakefile per rule does not seem like a good solution, so let\u2019s try to improve this. Exercise: Delete the results/ folder, copy the two previous rules ( first_step and second_step ) in the same Snakefile (place the first_step rule first) and try to run the workflow without specifying an output . What happens? Answer * Delete the `results` folder: using the graphic interface or `rm -rf results/` * Execute the workflow without output: `snakemake --cores 1` Only the first output, `results/first_step.txt`, is created. During its execution, Snakemake tries to generate a specific output called **target** and resolve all dependencies based on this target. A target can be any output that can be generated by any rule in the workflow. When you do not specify a target, the default one is the output of the first rule in the Snakefile, here `results/first_step.txt` of `rule first_step`. Exercise: With this in mind, instead of one target, use a space-separated list of targets in your command, to generate multiple targets. Use the -F to force the re-run of the whole workflow or delete your results/ folder beforehand. Answer * Delete the `results` folder: using the graphic interface or `rm -rf results/` * Execute the workflow with multiple targets: `snakemake --cores 1 results/first_step.txt results/second_step.txt` We should now see Snakemake execute the 2 rules and produce both targets/outputs. Chaining rules Once again, writing all the outputs in the snakemake command does not look like a good solution: it is very time-consuming, error-prone (and annoying)! Imagine what happens when your workflow generate tens of outputs?! Fortunately, there is a way to simplify this, which relies on rules dependency. The core principle of Snakemake\u2019s execution is to compute a Directed Acyclic Graph (DAG) that summarizes dependencies between all the inputs and outputs required to generate the final desired outputs. For each job, starting from the jobs generating the final outputs, Snakemake checks if the required inputs exist. If they do not, the software looks for a rule that generates these inputs. This process is repeated until all dependencies are resolved. This is why Snakemake is said to have a \u2018bottom-up\u2019 approach: it starts from the last outputs and go back to the first inputs. Hint Your Snakefile should look like this: rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' rule second_step : input : 'results/first_step.txt' output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' Exercise: Delete the results/ folder, identify your final output(s) and execute the workflow specifying only this(these) output(s) in the command. Answer Delete the results folder: using the graphic interface or rm -rf results/ Execute the workflow: snakemake --cores 1 results/second_step.txt Visualise the content of the results folder: ls -alh results/ We should now see Snakemake executing the 2 rules and producing both outputs. To generate the output results/second_step.txt , Snakemake requires the input results/first_step.txt . Before the workflow is executed, this file does not exist, therefore, Snakemake looks for a rule that generates results/first_step.txt , in this case the rule first_step . The process is then repeated for first_step . In this case, the rule does not require any input, so all dependencies are resolved, and Snakemake can generate the DAG. Important notes on rules dependency Rules must produce unique outputs Because of the rules dependency process, by default, an output can only be generated by a single rule. Otherwise, Snakemake cannot decide which rule to use to generate this output, and the rules are considered ambiguous . In practice, there are ways to deal with ambiguous rules, but they should be avoided as much as possible and we will not cover them in this course (see the relevant section in the official documentation for more information). Rules dependency can be written more easily It is possible to refer to the output of a rule directly in another rule with the syntax rules..output . Note that you don\u2019t need quotes around this statement, because it is a Snakemake object. The following example implements this syntax for the two rule defined above: rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' rule second_step : input : rules . first_step . output output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' This method has several advantages, among which: It limits the risk of error because we do not have to write the same filename at several locations A change in output name will be automatically propagated to rules that depend on it, i.e. the name only has to be changed once This makes the code much clearer and easier to understand: with this syntax, we instantly know the object type ( rule ), how/where it is created ( first_step ), and what it is ( output )","title":"Introduction to Snakemake"},{"location":"course_material/day2/2_introduction_snakemake/#learning-outcomes","text":"After having completed this chapter you will be able to: Understand the structure of a Snakemake workflow Write rules and Snakefiles to produce the desired outputs Chain rules together Run a Snakemake workflow","title":"Learning outcomes"},{"location":"course_material/day2/2_introduction_snakemake/#structuring-a-workflow","text":"It is advised to implement your code in a directory called workflow (more information about this in the next series of exercises). You are free to choose the names and location of files for the different steps of your workflow, but, for now, we recommend that you at least group all outputs from the workflow in a results folder within the workflow directory. A small reminder about conda environment If you try to run a command and get an error such as Command 'snakemake' not found , you are probably not in the right environment. To list them, use conda env list . Then activate the right environment with conda activate . You can deactivate an environment with conda deactivate . To list the packages installed in an environment, activate it and use conda list .","title":"Structuring a workflow"},{"location":"course_material/day2/2_introduction_snakemake/#exercises","text":"This series of exercises will bear no biological meaning, on purpose: it is designed to show you the fundamentals of Snakemake.","title":"Exercises"},{"location":"course_material/day2/2_introduction_snakemake/#creating-a-basic-rule","text":"Rules are the basic blocks of a Snakemake workflow. A rule is like a recipe indicating how to produce a specific output . The actual application of a rule to create an output is called a job . A rule is defined in a Snakefile with the keyword rule and contains directives which indicate the rule\u2019s properties. To create the simplest rule possible, we need at least two directives : output : path of the output file for this rule shell : shell commands to execute in order to generate the output We will see other directives later in the course. Exercise: The following example shows the minimal syntax to implement a rule. What do you think it does? Does it create a file? If so, how is it called? rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' Answer This rule uses the echo shell command to print the line \u201csnakemake\u201d in an output file called first_step.txt , located in the results folder. Rules are defined and written in a file called Snakefile (note the capital S and the absence of extension in the filename). This file should be located at the root of the workflow directory (here, workflow/Snakefile ). Exercise: Create a Snakefile and copy the previous rule in it. Because the Snakemake language is built on top of Python, spaces and indents are essential, so do not forget to keep the indentation as is and use space characters in the indents instead of tabs.","title":"Creating a basic rule"},{"location":"course_material/day2/2_introduction_snakemake/#executing-a-workflow-with-a-precise-output","text":"It is now time to execute your first worklow! To do this, you need to tell Snakemake what is your target, i.e. what is the output that you want to generate. Exercise: Execute the workflow with snakemake --cores 1 . What value should you use for ? Once Snakemake execution is finished, can you locate the output file? Answer Execute the workflow: snakemake --cores 1 results/first_step.txt Visualise the content of the results folder: ls -alh results/ Check the output content: cat results/first_step.txt Note that during the execution of the workflow, Snakemake automatically created the missing folder ( results/ ) in the output path. If several folders are missing (for example, here, test1/test2/test3/first_step.txt ), Snakemake will create all of them . Exercise: Re-run the exact same command. What happens? Answer Nothing! We get a message saying that Snakemake did not run anything: ``` Building DAG of jobs... Nothing to be done (all requested files are present and up to date). ``` By default, Snakemake only runs a job if: * A target file explicitly requested in the `snakemake` command is missing * An intermediate file is missing and is required produce a target file * It notices input files newer than output files, based on file modification dates. In this case, Snakemake will generate again the existing outputs. We can change this behaviour and force the re-run of a specific target by using the `-f` option: `snakemake --cores 1 -f results/first_step.txt` or force recreate ALL the outputs of the workflow using the `-F` option: `snakemake --cores 1 -F`. In practice, we can also alter Snakemake (re-)run policy, but we will not cover this topic in the course (see [--rerun-triggers option](https://snakemake.readthedocs.io/en/stable/executing/cli.html) in Snakemake's CLI help and [this git issue](https://github.com/snakemake/snakemake/issues/1694) for more information). In the previous example, the values of the two rule directives are strings . For the shell directive (we will see other types of directive values later in the course), long string can be written on multiple lines for clarity, simply using a set of quotes for each line: rule first_step : output : 'results/first_step.txt' shell : 'echo \"I want to print a very very very very very very ' 'very very very very long string in my output\" > results/first_step.txt' Here, Snakemake will simply concatenate the two lines (paste each line one after the other) and execute the resulting command.","title":"Executing a workflow with a precise output"},{"location":"course_material/day2/2_introduction_snakemake/#adding-an-input-directive","text":"The next directive used by most rules is input . It indicates the path to a file that is required by the rule to generate the output. In the following example, we modified the previous rule to use the file previously created results/first_step.tsv as an input, and copy this file to results/second_step.txt : rule second_step : input : 'results/first_step.txt' output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' Note that with this rule definition, Snakemake will not run if results/first_step.tsv does not exist! Exercise: Modify your first rule to add an input directive and execute the workflow. Check that the output was created and that the files are identical. If you get a Missing input files for rule error, that means that the input file is missing and cannot be created. How can you solve this problem? Answer Execute the workflow: snakemake --cores 1 results/second_step.txt Visualise the content of the results folder: ls -alh results/ Check that the files are identical: diff results/first_step.txt results/second_step.txt If the input file is missing, you can create it with echo \u201csnakemake\u201d > results/first_step.txt and then execute the workflow. We will see later why this happened and how to avoid it!","title":"Adding an input directive"},{"location":"course_material/day2/2_introduction_snakemake/#creating-a-workflow-with-several-rules","text":"Creating one Snakefile per rule does not seem like a good solution, so let\u2019s try to improve this. Exercise: Delete the results/ folder, copy the two previous rules ( first_step and second_step ) in the same Snakefile (place the first_step rule first) and try to run the workflow without specifying an output . What happens? Answer * Delete the `results` folder: using the graphic interface or `rm -rf results/` * Execute the workflow without output: `snakemake --cores 1` Only the first output, `results/first_step.txt`, is created. During its execution, Snakemake tries to generate a specific output called **target** and resolve all dependencies based on this target. A target can be any output that can be generated by any rule in the workflow. When you do not specify a target, the default one is the output of the first rule in the Snakefile, here `results/first_step.txt` of `rule first_step`. Exercise: With this in mind, instead of one target, use a space-separated list of targets in your command, to generate multiple targets. Use the -F to force the re-run of the whole workflow or delete your results/ folder beforehand. Answer * Delete the `results` folder: using the graphic interface or `rm -rf results/` * Execute the workflow with multiple targets: `snakemake --cores 1 results/first_step.txt results/second_step.txt` We should now see Snakemake execute the 2 rules and produce both targets/outputs.","title":"Creating a workflow with several rules"},{"location":"course_material/day2/2_introduction_snakemake/#chaining-rules","text":"Once again, writing all the outputs in the snakemake command does not look like a good solution: it is very time-consuming, error-prone (and annoying)! Imagine what happens when your workflow generate tens of outputs?! Fortunately, there is a way to simplify this, which relies on rules dependency. The core principle of Snakemake\u2019s execution is to compute a Directed Acyclic Graph (DAG) that summarizes dependencies between all the inputs and outputs required to generate the final desired outputs. For each job, starting from the jobs generating the final outputs, Snakemake checks if the required inputs exist. If they do not, the software looks for a rule that generates these inputs. This process is repeated until all dependencies are resolved. This is why Snakemake is said to have a \u2018bottom-up\u2019 approach: it starts from the last outputs and go back to the first inputs. Hint Your Snakefile should look like this: rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' rule second_step : input : 'results/first_step.txt' output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' Exercise: Delete the results/ folder, identify your final output(s) and execute the workflow specifying only this(these) output(s) in the command. Answer Delete the results folder: using the graphic interface or rm -rf results/ Execute the workflow: snakemake --cores 1 results/second_step.txt Visualise the content of the results folder: ls -alh results/ We should now see Snakemake executing the 2 rules and producing both outputs. To generate the output results/second_step.txt , Snakemake requires the input results/first_step.txt . Before the workflow is executed, this file does not exist, therefore, Snakemake looks for a rule that generates results/first_step.txt , in this case the rule first_step . The process is then repeated for first_step . In this case, the rule does not require any input, so all dependencies are resolved, and Snakemake can generate the DAG.","title":"Chaining rules"},{"location":"course_material/day2/2_introduction_snakemake/#important-notes-on-rules-dependency","text":"","title":"Important notes on rules dependency"},{"location":"course_material/day2/2_introduction_snakemake/#rules-must-produce-unique-outputs","text":"Because of the rules dependency process, by default, an output can only be generated by a single rule. Otherwise, Snakemake cannot decide which rule to use to generate this output, and the rules are considered ambiguous . In practice, there are ways to deal with ambiguous rules, but they should be avoided as much as possible and we will not cover them in this course (see the relevant section in the official documentation for more information).","title":"Rules must produce unique outputs"},{"location":"course_material/day2/2_introduction_snakemake/#rules-dependency-can-be-written-more-easily","text":"It is possible to refer to the output of a rule directly in another rule with the syntax rules..output . Note that you don\u2019t need quotes around this statement, because it is a Snakemake object. The following example implements this syntax for the two rule defined above: rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' rule second_step : input : rules . first_step . output output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' This method has several advantages, among which: It limits the risk of error because we do not have to write the same filename at several locations A change in output name will be automatically propagated to rules that depend on it, i.e. the name only has to be changed once This makes the code much clearer and easier to understand: with this syntax, we instantly know the object type ( rule ), how/where it is created ( first_step ), and what it is ( output )","title":"Rules dependency can be written more easily"},{"location":"course_material/day2/3_generalising_snakemake/","text":"Learning outcomes After having completed this chapter you will be able to: Create rules with multiple inputs and outputs Make the code shorter and more general by using placeholders and wildcards Optimise the memory usage of a workflow and checking its performances Visualise a workflow DAG Data origin The data we will use during the exercises was produced in this work . Briefly, the team studied the transcriptional response of a strain of baker\u2019s yeast, Saccharomyces cerevisiae , facing environments with different amount of CO 2 . To this end, they performed 150 bp paired-end sequencing of mRNA-enriched samples. Detailed information on all the samples are available here , but just know that for the purpose of the course, we selected 6 samples ( 3 replicates per condition , low and high CO 2 ) and down-sampled them to 1 million read-pairs each to reduce computation times. Exercises One of the aims of today\u2019s course is to develop a basic, yet efficient, workflow to analyse RNAseq data. This workflow takes reads coming from RNA sequencing as inputs and outputs a list of genes that are differentially expressed between two conditions. The files containing the reads are in FASTQ format and the output will be a tab-separated file containing a list of genes with expression changes, results of statistical tests\u2026 In this series of exercises, we will create the \u2018backbone\u2019 of the workflow, i.e. the rules that are the most computationally expensive, namely: A rule to trim poor-quality reads A rule to map the trimmed reads on a reference genome A rule to convert and sort files from the SAM format to the BAM format A rule to count the reads mapping on each gene At the end of this series of exercises, the DAG of your workflow should look like this: Designing and debugging a workflow If you have problems designing your Snakemake workflow or debugging it, you can find some help here . General instructions and reminders In each rule, you should try (as much as possible) to: Choose meaningful rule names Use rules dependency, with the syntax rules..output If you use numbered outputs, the syntax becomes rules..output[n] (with n starting at 0) If you use named outputs, the syntax becomes rules..output. Use placeholders Use wildcards Choose meaningful wildcard names The output , log , and benchmark directives must have the same wildcard names! You can use the same wildcard names in multiple rules for consistency and readability, but Snakemake will treat them as independent wildcards and their values will not be shared: rules are self-contained and wildcards are local to each rule ( see a very nice summary on wildcards ) Use multiple inputs/outputs (when needed/possible) Create a log file with the log directive Create a benchmark file with the benchmark directive If you have a doubt, do not hesitate to test your workflow logic with a dry-run (the -n flag): snakemake --cores 1 -n . Snakemake will then display all the jobs required to generate the target. To obtain additional information on why a specific job is necessary, run Snakemake with the -r flag (which can be -and usually is- combined with -n ): snakemake --cores 1 -n -r . For each job, Snakemake will print a reason field explaining why the job was required. To visualize the exact command executed by each job (with the placeholders and wildcards replaced by their values), run snakemake with the -p flag: snakemake --cores 1 -n -r -p . Downloading the data and setting up the directory structure In this part, we will download the data and start building the directory structure of our workflow according to the official recommendations . We already starting doing so in the previous series of exercises and ultimately, it should resemble this: \u2502\u2500\u2500 .gitignore \u2502\u2500\u2500 README.md \u2502\u2500\u2500 LICENSE.md \u2502\u2500\u2500 benchmarks \u2502 \u2502\u2500\u2500 sample1.fastq \u2502 \u2514\u2500\u2500 sample2.fastq \u2502\u2500\u2500 config \u2502 \u2502\u2500\u2500 config.yaml \u2502 \u2514\u2500\u2500 some-sheet.tsv \u2502\u2500\u2500 data \u2502 \u2502\u2500\u2500 sample1.fastq \u2502 \u2514\u2500\u2500 sample2.fastq \u2502\u2500\u2500 images \u2502 \u2514\u2500\u2500 rulegraph.svg \u2502\u2500\u2500 logs \u2502 \u2502\u2500\u2500 sample1.log \u2502 \u2514\u2500\u2500 sample2.log \u2502\u2500\u2500 results \u2502 \u2502\u2500\u2500 sample1 \u2502 \u2502 \u2514\u2500\u2500 sample1.bam \u2502 \u2502\u2500\u2500 sample2 \u2502 \u2502 \u2514\u2500\u2500 sample2.bam \u2502 \u2514\u2500\u2500 DEG_list.tsv \u2502\u2500\u2500 resources \u2502 \u2502\u2500\u2500 Scerevisiae.fasta \u2502 \u2514\u2500\u2500 Scerevisiae.gtf \u2514\u2500\u2500 workflow \u2502\u2500\u2500 envs \u2502 \u2502\u2500\u2500 tool1.yaml \u2502 \u2514\u2500\u2500 tool2.yaml \u2502\u2500\u2500 rules \u2502 \u2502\u2500\u2500 module1.smk \u2502 \u2514\u2500\u2500 module2.smk \u2502\u2500\u2500 scripts \u2502 \u2502\u2500\u2500 script1.py \u2502 \u2514\u2500\u2500 script2.R \u2514\u2500\u2500 Snakefile For now, the main thing to remember is that the workflow code goes into a subfolder called workflow and the rest is mostly input/output files, except for the config subfolder, which will be explained later. All output files generated in the workflow should be stored under results/ . Now, let\u2019s download the data, uncompress it and build the first part of the directory structure. wget https://apollo.vital-it.ch/trackvis/snakemake_rnaseq.tar.gz # Download the data # AT. Check URL to download file tar -xvf snakemake_rnaseq.tar.gz # Uncompress the archive rm snakemake_rnaseq.tar.gz # Delete the archive cd snakemake_rnaseq/ # Start developing in a new folder In this new folder, you should now see 2 subfolders: data/ , which contains the data to analyse resources/ , which contains retrieved resources, here the assembly, the genome indices and the annotation file of S. cerevisiae . It may also contain small resources delivered along with the workflow via git Let\u2019s create another subfolder, this time to host all the files containing the code, as well as the Snakefile: mkdir workflow # Create a new folder touch workflow/Snakefile # Create an empty Snakefile The Snakefile marks the entrypoint of the workflow. It will be automatically discovered when running Snakemake from the root of the structure, here snakemake_rnaseq/ . We can also tell Snakemake to use a specific Snakefile with the -s flag: snakemake --cores 1 -s , but it is highly discouraged as it hampers reproducibility. If you followed the general instructions , Snakemake should create all the other missing folders by itself (except one that you will discover at the end of this series of exercises), so it is now time to create the rules mentioned earlier . Have a look here for a few pieces of advice on workflow design. \u2018bottom-up\u2019 or \u2018top-down\u2019 development? Even if it is often easier to start from the final outputs and work backwards to the first inputs, the next exercises are presented in the opposite direction (first inputs to last outputs) to make the session easier to understand. That being said, feel free to work and develop your code in the order you prefer! Even if we asked you to use wildards, do not try to process all the samples yet. Choose and work with one sample (which means two .fastq files because reads are paired-end) in this series of exercises. We will see an efficient way to process list of files in the next series of exercises. Creating a rule to trim reads Usually, the first step in dealing with sequencing data is to improve the reads quality by removing low quality bases, stretches of As and Ns and reads that are too short. Adapters trimming In theory, trimming also removes sequencing adapters, but we will not do it here to keep computation time low and avoid having to parse other files to extract the adapter sequences. Exercise: Implement a rule to trim the reads contained in .fastq files using atropos . Hint You can find information on how to use atropos and its parameters with atropos trim -h The files to trim are located in data/ The base of the trimming command is atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" If you are interested in what these options mean, see below for an explanation The paths of the files to trim ( i.e. input files, in FASTQ format) are specified with the options -pe1 (first read) and -pe2 (second read) The paths of the trimmed files ( i.e. output files, also in FASTQ format) are specified with the options -o (first read) and -p (second read) atropos outputs some information as well as its trimming report in the terminal (stdout to be exact); do not forget to redirect these information to the log file with >> {log} Please give it a try before looking at the answer! Answer This is one way of writing this rule, but definitely not the only way! This is true for all the rules presented here. rule fastq_trim : ''' This rule trims paired-end reads to improve their quality. Specifically, it removes: - Low quality bases - A stretches longer than 20 bases - N stretches ''' input : reads1 = 'data/ {sample} _1.fastq' , reads2 = 'data/ {sample} _2.fastq' , output : trim1 = 'results/ {sample} / {sample} _atropos_trimmed_1.fastq' , trim2 = 'results/ {sample} / {sample} _atropos_trimmed_2.fastq' log : 'logs/ {sample} / {sample} _atropos_trimming.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_trimming.txt' resources : mem_mb = 500 shell : ''' echo \"Trimming reads in <{input.reads1}> and <{input.reads2}>\" > {log} atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo \"Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively\" >> {log} echo \"Trimming report saved in <{log}>\" >> {log} ''' Note 2 things that are happening here: We used the {sample} wildcards twice in the output paths. This is because we prefer to have all the files linked to a sample in the same directory We added a memory limit for this job: 500 MB. Because we have limited resources in this server compared to a High Performance Computing cluster (HPC), this will help Snakemake to better allocate resources and parallelise jobs. You can determine the maximum amount of memory used by a rule thanks to the max_rss column in a benchmark result (results are shown in MB). More information here Paths in Snakemake All the paths in the Snakefile are relative to the working directory in which the snakemake command is executed. If you execute Snakemake in snakemake_rnaseq/ , the relative path to the input files in the rule is data/.fastq If you execute Snakemake in snakemake_rnaseq/workflow/ , the relative path to the input files in the rule is ../data/.fastq Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_atropos_trimmed_1.fastq If you run it now, don\u2019t forget to have a look at the log and benchmark files! atropos options -q 20,20 : trim low-quality bases from 5\u2019, 3\u2019 ends of each read before adapter removal --minimum-length 25 : discard trimmed reads that are shorter than 25 bp --trim-n : trim N\u2019s on ends of reads --preserve-order : preserve order of reads in input files --max-n 10 : discard reads with more than 10 N --no-cache-adapters : do not cache adapters list as \u2018.adapters\u2019 in the working directory -a \"A{{20}}\" -A \"A{{20}}\" : remove series of 20 As in the adapter sequence ( -a for the first read of the pair, -A for the second one) The usual command-line syntax is -a \"A{20}\" . Here, brackets were doubled to prevent Snakemake from interpreting {20} as a wildcard Creating a rule to map trimmed reads on a reference genome Once we have trimmed reads, the next step is to map those reads onto a reference assembly, here S. cerevisiae strain S288C, to eventually obtain read counts. The assembly used in this exercise is RefSeq GCF_000146045.2 and was retrieved via the NCBI genome website. Exercise: Implement a rule to map the trimmed reads on the S. cerevisiae assembly using HISAT2 . HISAT2 genome index To align reads on a genome, HISAT2 relies on a graph-based index. We built the genome index for you, using the command hisat2-build -p 24 -f Scerevisiae.fasta resources/genome_indices/Scerevisiae_index . -p is the number of threads to use, -f is the genomic sequence in FASTA format and Scerevisiae_genome_index is the global name shared by all the index files. Hint You can find information on how to use HISAT2 and its parameters with hisat2 -h The base of the mapping command is hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal If you are interested in what these options mean, see below for an explanation The path of the genome indices ( i.e. input files, in binary format) is specified with the option -x . The files have a shared title of resources/genome_indices/Scerevisiae_genome_index , which is the value you need to use for -x The paths of the trimmed files ( i.e. input files) are specified with the options -1 (first read) and -2 (second read) The path of the mapped reads file ( i.e. output file, in SAM format) is specified with the option -S (do not forget the .sam extension to the filename) The path of the mapping report ( i.e. output file, in text format) is specified with the option --summary-file HISAT2 also outputs information in the terminal (stderr to be exact); do not forget to redirect these to the log file with 2>> {log} This step is the longest of the workflow. With the current settings, it should take ~6 min to complete. If you decide to run it now, you should launch it and start working on the next rules Please give it a try before looking at the answer! Answer rule read_mapping : ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input : trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : sam = 'results/ {sample} / {sample} _mapped_reads.sam' , report = 'results/ {sample} / {sample} _mapping_report.txt' log : 'logs/ {sample} / {sample} _mapping.log' benchmark : 'benchmarks/ {sample} / {sample} _mapping.txt' resources : mem_gb = 2 shell : ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x resources/genome_indices/Scerevisiae_genome_index -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads.sam If you run it now, don\u2019t forget to have a look at the log and benchmark files! HISAT2 options --dta : report alignments tailored for transcript assemblers --fr : set alignment of -1, -2 mates to forward/reverse (position of reads in a pair relatively to each other) --no-mixed : remove unpaired alignments for paired reads --no-discordant : remove discordant alignments for paired reads --time : print wall-clock time taken by search phases --new-summary : print alignment summary in a new style --no-unal : suppress SAM records for reads that failed to align Creating a rule to convert and sort SAM files to BAM HISAT2 only outputs mapped reads in the SAM format . However, most downstream analysis tools use the BAM format , which is the compressed binary version of the SAM format and, as such, is much smaller, easier to manipulate and transfer and allows a faster data retrieval. Additionally, many analyses require that BAM files are sorted by genomic coordinates and indexed, because sorted BAM files can be processed much more easily and quickly than unsorted ones. Alignment data files More information on alignment data files and other formats on the official github repository of the formats. Exercise: Implement a single rule to: Convert SAM files to BAM using Samtools Sort the BAM files using Samtools Index the sorted BAM files using Samtools Hint You can find information on how to use Samtools and its parameters with samtools --help You need to write 3 commands that will be executed sequentially: the output of command 1 will be the input of command 2 etc\u2026 No panic! These commands are pretty simple and do not use many options! To convert SAM format to the BAM format, use the command samtools view -b -o To sort a BAM file, use the command samtools sort -O bam -o To index a BAM file, use the command samtools index -b -o The index must have the exact same name than its associated BAM file, except it finishes with the extension .bam.bai instead of .bam If you are interested in what these options mean, see below for an explanation To catch potential information and errors, do not forget to redirect stderr to the log file with 2>> {log} Please give it a try before looking at the answer! Answer rule sam_to_bam : ''' This rule converts a sam file to bam format, sorts it and indexes it. ''' input : sam = rules . read_mapping . output . sam output : bam = 'results/ {sample} / {sample} _mapped_reads.bam' , bam_sorted = 'results/ {sample} / {sample} _mapped_reads_sorted.bam' , index = 'results/ {sample} / {sample} _mapped_reads_sorted.bam.bai' log : 'logs/ {sample} / {sample} _mapping_sam_to_bam.log' benchmark : 'benchmarks/ {sample} / {sample} _mapping_sam_to_bam.txt' resources : mem_mb = 250 shell : ''' echo \"Converting <{input.sam}> to BAM format\" > {log} samtools view {input.sam} -b -o {output.bam} 2>> {log} echo \"Sorting BAM file\" >> {log} samtools sort {output.bam} -O bam -o {output.bam_sorted} 2>> {log} echo \"Indexing the sorted BAM file\" >> {log} samtools index -b {output.bam_sorted} -o {output.index} 2>> {log} echo \"Sorted file saved in <{output.bam_sorted}>\" >> {log} ''' Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads_sorted.bam If you run it now, don\u2019t forget to have a look at the log and benchmark files! Samtools options samtools view -b : flag to tell Samtools to create an output in BAM format -o : path of the output file samtools view -O bam : flag to tell Samtools to create an output in BAM format -o : path of the output file samtools index -b : flag to tell Samtools to create an index in BAI format Creating a rule to count mapped reads Most of the analyses happening downstream the alignment step, including Differential Expression Analyses, are starting off read counts, either by exon or gene. However, we are still missing those counts! Counting reads on exons/genes To count reads mapping on genomic features, we first need a definition of those features. In this case, we picked one of the best-known model organism, S. cerevisiae , which has been annotated for a long time. These annotations are easily available on the NCBI or the Saccharomyces Genome Database . If your organism has not been annotated yet, there are ways to work around this problem, but this is an entirely different field that we won\u2019t discuss here! Chromosome names If you are working with genome sequences and annotations from different sources, remember that they must contain the chromosome names, otherwise counting will not work. Exercise: Implement a rule to count the reads mapping on each gene of the S. cerevisiae genome using featureCounts . Hint You can find information on how to use featureCounts and its parameters with featureCounts -h The base of the mapping command is featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF If you are interested in what these options mean, see below for an explanation The path of the file containing the annotations ( i.e. input files, in GTF format) is specified with the -a option. This file is located at resources/Scerevisiae.gtf There are two main annotations format: GTF and GFF . The former is lighter and easier to work with, so that is the one we will use The paths of the sorted BAM file(s) ( i.e. input file(s)) are not specified with an option, they are simply added at the end of the command The path of the file containing the count results ( i.e. output file, in tsv format) is specified with the option -o featureCounts will also output a separate file (in tsv format) including summary statistics of counting results, with the name .summary. For example, if the output is test.tsv , the summary will be printed in test.tsv.summary . Do not forget this output in your rule featureCounts also outputs information in the terminal (stderr to be exact); do not forget to redirect these to the log file with 2>> {log} Please give it a try before looking at the answer! Answer rule reads_quantification_genes : ''' This rule quantifies the reads of a bam file mapping on genes and produces a count table for all genes of the assembly. ''' input : bam_once_sorted = rules . sam_to_bam . output . bam_sorted , output : gene_level = 'results/ {sample} / {sample} _genes_read_quantification.tsv' , gene_summary = 'results/ {sample} / {sample} _genes_read_quantification.summary' log : 'logs/ {sample} / {sample} _genes_read_quantification.log' benchmark : 'benchmarks/ {sample} / {sample} _genes_read_quantification.txt' resources : mem_mb = 500 shell : ''' echo \"Counting reads mapping on genes in <{input.bam_once_sorted}>\" > {log} featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a resources/Scerevisiae.gtf -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo \"Renaming output files\" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo \"Results saved in <{output.gene_level}>\" >> {log} echo \"Report saved in <{output.gene_summary}>\" >> {log} ''' featureCounts options -t : specify on which feature type to count the reads -g : specify if and how to gather feature counts. Here, reads are counted by exon ( -t ) and the exon counts are gathered by genes \u2018meta-features\u2019 ( -g ) -s : perform strand-specific read counting Strandedness is determined by looking at the mRNA library preparation kit. It can also be determined a posteriori with scripts such as infer_experiment.py from the RSeQC package -p : count fragments instead of reads. If you don\u2019t use this option with paired-end reads, featureCounts won\u2019t be able to assign the read-pairs to features -B : only count read pairs that have both ends aligned -C : do not count read pairs that have their two ends mapping to different chromosomes or mapping on the same chromosome but on different strands --largestOverlap : assign reads to the meta-feature/feature that has the largest number of overlapping bases -F : specify format of the provided annotation file --verbose : output verbose information, such as unmatched chromosome/contig names Running the workflow Exercise: If you have not done it after each step, it is now time to run the entire workflow on your sample of choice. What command will you use to run it? Answer Because all the rules are chained together, you only need to specify one of the final outputs to trigger the execution of all the previous rules: snakemake --cores 1 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv . You can add the -F flag to force an entire re-run. The entire run should take about ~10 min to complete. Exercise: Check Snakemake\u2019s log in .snakemake/log/ . Is everything as you expected, especially the wildcards values, input and output names etc\u2026? Answer cat .snakemake/log/ Visualising the DAG of the workflow We have now implemented and run the main steps of our workflow. It is always a good idea to visualise the whole process to check for errors and inconsistencies. Snakemake\u2019s has a built-in workflow visualisation feature to do this. Exercise: Visualise the entire workflow\u2019s Directed Acyclic Graph using the --dag flag. Do you need to specify a target? Hint Try to follow the official recommendations on workflow structure, which states that images are supposed to go in the images/ subfolder Snakemake prints a DAG in text format, so we need to use the dot command to transform it into a picture Save the result as a PNG picture Answer If we run the command without target: snakemake --cores 1 --dag -F | dot -Tpng > images/dag.png , we will get a Target rules may not contain wildcards. error, which means we need to add a target. Same as before, it makes sense to use one of the final outputs to get the entire workflow: snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png . But once again, we will get an error: BrokenPipeError: [Errno 32] Broken pipe . This is because we are piping the command output to a folder ( images/ ) that does not exist yet The folder is not created by Snakemake because it isn\u2019t handled as part of an actual run. So we have to create the folder before generating the DAG: mkdir images snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png Some explanations on the command: -F : force to show the entire worklow and ensures all jobs are shown. You can also use -f to show fewer jobs dot : tool that is a part of the graphviz package and is used to draw hierarchical or layered drawings of directed graphs, i.e. graphs in which edges (arrows) have a direction -T : control the image format. Available formats are listed here DAG aspect If you already computed all the outputs of the workflow, steps in the DAG will have dotted lines. To visualise the DAG before running the workflow, add -F/--forceall to the snakemake command to force the execution of all jobs. DAG = dry-run The --dag flag implicitly activates the --dry-run/--dryrun/-n option, which means that no jobs are executed during the plot creation. There are actually 3 types of DAG: A DAG, created with the --dag option A filegraph, created with the --filegraph option A rulegraph, created with the --rulegraph option Exercise: Generate the filegraph and rulegraph of your workflow. Feel free to try different pictures format. What are the differences between the plots? Answer Generate the rulegraph: snakemake --cores 1 --rulegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpdf > images/rulegraph.pdf Generate the filegraph: snakemake --cores 1 --filegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tjpg > images/filegraph.jpg You should obtain the 3 following figures, respectively DAG, rulegraph and filegraph: The differences are: --dag : dependency graph of all the jobs --filegraph : dependency graph of rules with inputs and outputs (rule appears once, with wildcards) --rulegraph : dependency graph of rules (rule appears once) Designing a Snakemake workflow\u2026 and debugging it! Designing a workflow There are many ways to design a new workflow, but these few pieces of advice will be useful in most cases: Start with a pen and paper: try to find out how many rules you will need and how they depend on each other. In other terms, start by sketching the DAG of your workflow! Remember that Snakemake has a bottom-up approach (it goes from the final outputs to the first input), so it may be easier for you to work in that order as well and write your last rule first Determine which rules (if any) aggregate or split inputs and create input functions accordingly (we will see how these functions work in session 4) Make sure your input and output directives are right before worrying about anything else, especially the shell sections. Remember that Snakemake builds the DAG before running the shell commands, so you can use the --dryrun option to test the workflow before running it. You can even do that without writing all the shell commands! List any parameters or settings that might need to be adjusted Choose meaningful and easy-to-understand names for your inputs, outputs, parameters, wildcards\u2026 to make your Snakefile as readable as possible. This is true for every script, piece of code, variable etc\u2026 and Snakemake is no exception! Have a look at The Zen of Python for more information Debugging a workflow It is very likely you will see bugs and errors the first time you try to run a new Snakefile: don\u2019t be discouraged, this is normal! Order of operations in Snakemake The topic was tackled when DAGs were mentioned, but to efficiently debug a workflow, it is worth taking a deeper look at what Snakemake does when you execute the command snakemake --cores 1 . There are 3 main phases: Prepare to run: Read all the rule definitions from the Snakefile Resolve the DAG (when Snakemake says \u2018Building DAG of jobs\u2019): Check what output(s) are required Look for a matching rule by looking at the outputs of all the rules Fill in the wildcards to determine the input of the matching rule Check whether this input is available; if not, repeat Step 2 until everything is resolved Run: If needed, create the folder for the output(s) If needed, remove the outdated output(s) Run the shell command with the placeholders replaced Check that the command ran without errors and produced the expected output(s) Debugging advice Sometimes, Snakemake will give you a precise error report, but other times less so. Try to identify which phase of execution failed (see previous paragraph on order of operations) and double-check the most common error causes for that phase: Parsing phase failures (phase 1): Syntax errors, among which (but not limited to): This errors can be easily solved using a text editor with Python/Snakemake text colouring Missing commas/colons/semicolons Unbalanced quotes/brackets/parenthesis Wrong indentation Failure to evaluate expressions Problems in functions ( expand() , input functions\u2026) in input/output directives Python logic added outside of rules Other problems with rule definition Invalid rule names/directives Invalid wildcard names Mismatched wildcards DAG building failures (phase 2, before Snakemake tries to run any job): Failure to determine the target Ambiguous rules making the same output(s) On the contrary, no rule making the required output(s) Circular dependency (violating the \u2018Acyclic\u2019 property of a D A G). Write-protected output(s) DAG running failures (phase 3, --dry-run works and builds the DAG, but the real execution fails): When a job fails, Snakemake reports an error, deletes all output file(s) for that job (potential corruption), and stops Shell command returning non-zero status Missing output file(s) after the commands have run Reference to a $shell_variable before it was set Use of a wrong/unknown placeholder inside { }","title":"Making a more general-purpose Snakemake workflow"},{"location":"course_material/day2/3_generalising_snakemake/#learning-outcomes","text":"After having completed this chapter you will be able to: Create rules with multiple inputs and outputs Make the code shorter and more general by using placeholders and wildcards Optimise the memory usage of a workflow and checking its performances Visualise a workflow DAG","title":"Learning outcomes"},{"location":"course_material/day2/3_generalising_snakemake/#data-origin","text":"The data we will use during the exercises was produced in this work . Briefly, the team studied the transcriptional response of a strain of baker\u2019s yeast, Saccharomyces cerevisiae , facing environments with different amount of CO 2 . To this end, they performed 150 bp paired-end sequencing of mRNA-enriched samples. Detailed information on all the samples are available here , but just know that for the purpose of the course, we selected 6 samples ( 3 replicates per condition , low and high CO 2 ) and down-sampled them to 1 million read-pairs each to reduce computation times.","title":"Data origin"},{"location":"course_material/day2/3_generalising_snakemake/#exercises","text":"One of the aims of today\u2019s course is to develop a basic, yet efficient, workflow to analyse RNAseq data. This workflow takes reads coming from RNA sequencing as inputs and outputs a list of genes that are differentially expressed between two conditions. The files containing the reads are in FASTQ format and the output will be a tab-separated file containing a list of genes with expression changes, results of statistical tests\u2026 In this series of exercises, we will create the \u2018backbone\u2019 of the workflow, i.e. the rules that are the most computationally expensive, namely: A rule to trim poor-quality reads A rule to map the trimmed reads on a reference genome A rule to convert and sort files from the SAM format to the BAM format A rule to count the reads mapping on each gene At the end of this series of exercises, the DAG of your workflow should look like this: Designing and debugging a workflow If you have problems designing your Snakemake workflow or debugging it, you can find some help here .","title":"Exercises"},{"location":"course_material/day2/3_generalising_snakemake/#general-instructions-and-reminders","text":"In each rule, you should try (as much as possible) to: Choose meaningful rule names Use rules dependency, with the syntax rules..output If you use numbered outputs, the syntax becomes rules..output[n] (with n starting at 0) If you use named outputs, the syntax becomes rules..output. Use placeholders Use wildcards Choose meaningful wildcard names The output , log , and benchmark directives must have the same wildcard names! You can use the same wildcard names in multiple rules for consistency and readability, but Snakemake will treat them as independent wildcards and their values will not be shared: rules are self-contained and wildcards are local to each rule ( see a very nice summary on wildcards ) Use multiple inputs/outputs (when needed/possible) Create a log file with the log directive Create a benchmark file with the benchmark directive If you have a doubt, do not hesitate to test your workflow logic with a dry-run (the -n flag): snakemake --cores 1 -n . Snakemake will then display all the jobs required to generate the target. To obtain additional information on why a specific job is necessary, run Snakemake with the -r flag (which can be -and usually is- combined with -n ): snakemake --cores 1 -n -r . For each job, Snakemake will print a reason field explaining why the job was required. To visualize the exact command executed by each job (with the placeholders and wildcards replaced by their values), run snakemake with the -p flag: snakemake --cores 1 -n -r -p .","title":"General instructions and reminders"},{"location":"course_material/day2/3_generalising_snakemake/#downloading-the-data-and-setting-up-the-directory-structure","text":"In this part, we will download the data and start building the directory structure of our workflow according to the official recommendations . We already starting doing so in the previous series of exercises and ultimately, it should resemble this: \u2502\u2500\u2500 .gitignore \u2502\u2500\u2500 README.md \u2502\u2500\u2500 LICENSE.md \u2502\u2500\u2500 benchmarks \u2502 \u2502\u2500\u2500 sample1.fastq \u2502 \u2514\u2500\u2500 sample2.fastq \u2502\u2500\u2500 config \u2502 \u2502\u2500\u2500 config.yaml \u2502 \u2514\u2500\u2500 some-sheet.tsv \u2502\u2500\u2500 data \u2502 \u2502\u2500\u2500 sample1.fastq \u2502 \u2514\u2500\u2500 sample2.fastq \u2502\u2500\u2500 images \u2502 \u2514\u2500\u2500 rulegraph.svg \u2502\u2500\u2500 logs \u2502 \u2502\u2500\u2500 sample1.log \u2502 \u2514\u2500\u2500 sample2.log \u2502\u2500\u2500 results \u2502 \u2502\u2500\u2500 sample1 \u2502 \u2502 \u2514\u2500\u2500 sample1.bam \u2502 \u2502\u2500\u2500 sample2 \u2502 \u2502 \u2514\u2500\u2500 sample2.bam \u2502 \u2514\u2500\u2500 DEG_list.tsv \u2502\u2500\u2500 resources \u2502 \u2502\u2500\u2500 Scerevisiae.fasta \u2502 \u2514\u2500\u2500 Scerevisiae.gtf \u2514\u2500\u2500 workflow \u2502\u2500\u2500 envs \u2502 \u2502\u2500\u2500 tool1.yaml \u2502 \u2514\u2500\u2500 tool2.yaml \u2502\u2500\u2500 rules \u2502 \u2502\u2500\u2500 module1.smk \u2502 \u2514\u2500\u2500 module2.smk \u2502\u2500\u2500 scripts \u2502 \u2502\u2500\u2500 script1.py \u2502 \u2514\u2500\u2500 script2.R \u2514\u2500\u2500 Snakefile For now, the main thing to remember is that the workflow code goes into a subfolder called workflow and the rest is mostly input/output files, except for the config subfolder, which will be explained later. All output files generated in the workflow should be stored under results/ . Now, let\u2019s download the data, uncompress it and build the first part of the directory structure. wget https://apollo.vital-it.ch/trackvis/snakemake_rnaseq.tar.gz # Download the data # AT. Check URL to download file tar -xvf snakemake_rnaseq.tar.gz # Uncompress the archive rm snakemake_rnaseq.tar.gz # Delete the archive cd snakemake_rnaseq/ # Start developing in a new folder In this new folder, you should now see 2 subfolders: data/ , which contains the data to analyse resources/ , which contains retrieved resources, here the assembly, the genome indices and the annotation file of S. cerevisiae . It may also contain small resources delivered along with the workflow via git Let\u2019s create another subfolder, this time to host all the files containing the code, as well as the Snakefile: mkdir workflow # Create a new folder touch workflow/Snakefile # Create an empty Snakefile The Snakefile marks the entrypoint of the workflow. It will be automatically discovered when running Snakemake from the root of the structure, here snakemake_rnaseq/ . We can also tell Snakemake to use a specific Snakefile with the -s flag: snakemake --cores 1 -s , but it is highly discouraged as it hampers reproducibility. If you followed the general instructions , Snakemake should create all the other missing folders by itself (except one that you will discover at the end of this series of exercises), so it is now time to create the rules mentioned earlier . Have a look here for a few pieces of advice on workflow design. \u2018bottom-up\u2019 or \u2018top-down\u2019 development? Even if it is often easier to start from the final outputs and work backwards to the first inputs, the next exercises are presented in the opposite direction (first inputs to last outputs) to make the session easier to understand. That being said, feel free to work and develop your code in the order you prefer! Even if we asked you to use wildards, do not try to process all the samples yet. Choose and work with one sample (which means two .fastq files because reads are paired-end) in this series of exercises. We will see an efficient way to process list of files in the next series of exercises.","title":"Downloading the data and setting up the directory structure"},{"location":"course_material/day2/3_generalising_snakemake/#creating-a-rule-to-trim-reads","text":"Usually, the first step in dealing with sequencing data is to improve the reads quality by removing low quality bases, stretches of As and Ns and reads that are too short. Adapters trimming In theory, trimming also removes sequencing adapters, but we will not do it here to keep computation time low and avoid having to parse other files to extract the adapter sequences. Exercise: Implement a rule to trim the reads contained in .fastq files using atropos . Hint You can find information on how to use atropos and its parameters with atropos trim -h The files to trim are located in data/ The base of the trimming command is atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" If you are interested in what these options mean, see below for an explanation The paths of the files to trim ( i.e. input files, in FASTQ format) are specified with the options -pe1 (first read) and -pe2 (second read) The paths of the trimmed files ( i.e. output files, also in FASTQ format) are specified with the options -o (first read) and -p (second read) atropos outputs some information as well as its trimming report in the terminal (stdout to be exact); do not forget to redirect these information to the log file with >> {log} Please give it a try before looking at the answer! Answer This is one way of writing this rule, but definitely not the only way! This is true for all the rules presented here. rule fastq_trim : ''' This rule trims paired-end reads to improve their quality. Specifically, it removes: - Low quality bases - A stretches longer than 20 bases - N stretches ''' input : reads1 = 'data/ {sample} _1.fastq' , reads2 = 'data/ {sample} _2.fastq' , output : trim1 = 'results/ {sample} / {sample} _atropos_trimmed_1.fastq' , trim2 = 'results/ {sample} / {sample} _atropos_trimmed_2.fastq' log : 'logs/ {sample} / {sample} _atropos_trimming.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_trimming.txt' resources : mem_mb = 500 shell : ''' echo \"Trimming reads in <{input.reads1}> and <{input.reads2}>\" > {log} atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo \"Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively\" >> {log} echo \"Trimming report saved in <{log}>\" >> {log} ''' Note 2 things that are happening here: We used the {sample} wildcards twice in the output paths. This is because we prefer to have all the files linked to a sample in the same directory We added a memory limit for this job: 500 MB. Because we have limited resources in this server compared to a High Performance Computing cluster (HPC), this will help Snakemake to better allocate resources and parallelise jobs. You can determine the maximum amount of memory used by a rule thanks to the max_rss column in a benchmark result (results are shown in MB). More information here Paths in Snakemake All the paths in the Snakefile are relative to the working directory in which the snakemake command is executed. If you execute Snakemake in snakemake_rnaseq/ , the relative path to the input files in the rule is data/.fastq If you execute Snakemake in snakemake_rnaseq/workflow/ , the relative path to the input files in the rule is ../data/.fastq Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_atropos_trimmed_1.fastq If you run it now, don\u2019t forget to have a look at the log and benchmark files!","title":"Creating a rule to trim reads"},{"location":"course_material/day2/3_generalising_snakemake/#atropos-options","text":"-q 20,20 : trim low-quality bases from 5\u2019, 3\u2019 ends of each read before adapter removal --minimum-length 25 : discard trimmed reads that are shorter than 25 bp --trim-n : trim N\u2019s on ends of reads --preserve-order : preserve order of reads in input files --max-n 10 : discard reads with more than 10 N --no-cache-adapters : do not cache adapters list as \u2018.adapters\u2019 in the working directory -a \"A{{20}}\" -A \"A{{20}}\" : remove series of 20 As in the adapter sequence ( -a for the first read of the pair, -A for the second one) The usual command-line syntax is -a \"A{20}\" . Here, brackets were doubled to prevent Snakemake from interpreting {20} as a wildcard","title":"atropos options"},{"location":"course_material/day2/3_generalising_snakemake/#creating-a-rule-to-map-trimmed-reads-on-a-reference-genome","text":"Once we have trimmed reads, the next step is to map those reads onto a reference assembly, here S. cerevisiae strain S288C, to eventually obtain read counts. The assembly used in this exercise is RefSeq GCF_000146045.2 and was retrieved via the NCBI genome website. Exercise: Implement a rule to map the trimmed reads on the S. cerevisiae assembly using HISAT2 . HISAT2 genome index To align reads on a genome, HISAT2 relies on a graph-based index. We built the genome index for you, using the command hisat2-build -p 24 -f Scerevisiae.fasta resources/genome_indices/Scerevisiae_index . -p is the number of threads to use, -f is the genomic sequence in FASTA format and Scerevisiae_genome_index is the global name shared by all the index files. Hint You can find information on how to use HISAT2 and its parameters with hisat2 -h The base of the mapping command is hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal If you are interested in what these options mean, see below for an explanation The path of the genome indices ( i.e. input files, in binary format) is specified with the option -x . The files have a shared title of resources/genome_indices/Scerevisiae_genome_index , which is the value you need to use for -x The paths of the trimmed files ( i.e. input files) are specified with the options -1 (first read) and -2 (second read) The path of the mapped reads file ( i.e. output file, in SAM format) is specified with the option -S (do not forget the .sam extension to the filename) The path of the mapping report ( i.e. output file, in text format) is specified with the option --summary-file HISAT2 also outputs information in the terminal (stderr to be exact); do not forget to redirect these to the log file with 2>> {log} This step is the longest of the workflow. With the current settings, it should take ~6 min to complete. If you decide to run it now, you should launch it and start working on the next rules Please give it a try before looking at the answer! Answer rule read_mapping : ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input : trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : sam = 'results/ {sample} / {sample} _mapped_reads.sam' , report = 'results/ {sample} / {sample} _mapping_report.txt' log : 'logs/ {sample} / {sample} _mapping.log' benchmark : 'benchmarks/ {sample} / {sample} _mapping.txt' resources : mem_gb = 2 shell : ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x resources/genome_indices/Scerevisiae_genome_index -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads.sam If you run it now, don\u2019t forget to have a look at the log and benchmark files!","title":"Creating a rule to map trimmed reads on a reference genome"},{"location":"course_material/day2/3_generalising_snakemake/#hisat2-options","text":"--dta : report alignments tailored for transcript assemblers --fr : set alignment of -1, -2 mates to forward/reverse (position of reads in a pair relatively to each other) --no-mixed : remove unpaired alignments for paired reads --no-discordant : remove discordant alignments for paired reads --time : print wall-clock time taken by search phases --new-summary : print alignment summary in a new style --no-unal : suppress SAM records for reads that failed to align","title":"HISAT2 options"},{"location":"course_material/day2/3_generalising_snakemake/#creating-a-rule-to-convert-and-sort-sam-files-to-bam","text":"HISAT2 only outputs mapped reads in the SAM format . However, most downstream analysis tools use the BAM format , which is the compressed binary version of the SAM format and, as such, is much smaller, easier to manipulate and transfer and allows a faster data retrieval. Additionally, many analyses require that BAM files are sorted by genomic coordinates and indexed, because sorted BAM files can be processed much more easily and quickly than unsorted ones. Alignment data files More information on alignment data files and other formats on the official github repository of the formats. Exercise: Implement a single rule to: Convert SAM files to BAM using Samtools Sort the BAM files using Samtools Index the sorted BAM files using Samtools Hint You can find information on how to use Samtools and its parameters with samtools --help You need to write 3 commands that will be executed sequentially: the output of command 1 will be the input of command 2 etc\u2026 No panic! These commands are pretty simple and do not use many options! To convert SAM format to the BAM format, use the command samtools view -b -o To sort a BAM file, use the command samtools sort -O bam -o To index a BAM file, use the command samtools index -b -o The index must have the exact same name than its associated BAM file, except it finishes with the extension .bam.bai instead of .bam If you are interested in what these options mean, see below for an explanation To catch potential information and errors, do not forget to redirect stderr to the log file with 2>> {log} Please give it a try before looking at the answer! Answer rule sam_to_bam : ''' This rule converts a sam file to bam format, sorts it and indexes it. ''' input : sam = rules . read_mapping . output . sam output : bam = 'results/ {sample} / {sample} _mapped_reads.bam' , bam_sorted = 'results/ {sample} / {sample} _mapped_reads_sorted.bam' , index = 'results/ {sample} / {sample} _mapped_reads_sorted.bam.bai' log : 'logs/ {sample} / {sample} _mapping_sam_to_bam.log' benchmark : 'benchmarks/ {sample} / {sample} _mapping_sam_to_bam.txt' resources : mem_mb = 250 shell : ''' echo \"Converting <{input.sam}> to BAM format\" > {log} samtools view {input.sam} -b -o {output.bam} 2>> {log} echo \"Sorting BAM file\" >> {log} samtools sort {output.bam} -O bam -o {output.bam_sorted} 2>> {log} echo \"Indexing the sorted BAM file\" >> {log} samtools index -b {output.bam_sorted} -o {output.index} 2>> {log} echo \"Sorted file saved in <{output.bam_sorted}>\" >> {log} ''' Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads_sorted.bam If you run it now, don\u2019t forget to have a look at the log and benchmark files!","title":"Creating a rule to convert and sort SAM files to BAM"},{"location":"course_material/day2/3_generalising_snakemake/#samtools-options","text":"samtools view -b : flag to tell Samtools to create an output in BAM format -o : path of the output file samtools view -O bam : flag to tell Samtools to create an output in BAM format -o : path of the output file samtools index -b : flag to tell Samtools to create an index in BAI format","title":"Samtools options"},{"location":"course_material/day2/3_generalising_snakemake/#creating-a-rule-to-count-mapped-reads","text":"Most of the analyses happening downstream the alignment step, including Differential Expression Analyses, are starting off read counts, either by exon or gene. However, we are still missing those counts! Counting reads on exons/genes To count reads mapping on genomic features, we first need a definition of those features. In this case, we picked one of the best-known model organism, S. cerevisiae , which has been annotated for a long time. These annotations are easily available on the NCBI or the Saccharomyces Genome Database . If your organism has not been annotated yet, there are ways to work around this problem, but this is an entirely different field that we won\u2019t discuss here! Chromosome names If you are working with genome sequences and annotations from different sources, remember that they must contain the chromosome names, otherwise counting will not work. Exercise: Implement a rule to count the reads mapping on each gene of the S. cerevisiae genome using featureCounts . Hint You can find information on how to use featureCounts and its parameters with featureCounts -h The base of the mapping command is featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF If you are interested in what these options mean, see below for an explanation The path of the file containing the annotations ( i.e. input files, in GTF format) is specified with the -a option. This file is located at resources/Scerevisiae.gtf There are two main annotations format: GTF and GFF . The former is lighter and easier to work with, so that is the one we will use The paths of the sorted BAM file(s) ( i.e. input file(s)) are not specified with an option, they are simply added at the end of the command The path of the file containing the count results ( i.e. output file, in tsv format) is specified with the option -o featureCounts will also output a separate file (in tsv format) including summary statistics of counting results, with the name .summary. For example, if the output is test.tsv , the summary will be printed in test.tsv.summary . Do not forget this output in your rule featureCounts also outputs information in the terminal (stderr to be exact); do not forget to redirect these to the log file with 2>> {log} Please give it a try before looking at the answer! Answer rule reads_quantification_genes : ''' This rule quantifies the reads of a bam file mapping on genes and produces a count table for all genes of the assembly. ''' input : bam_once_sorted = rules . sam_to_bam . output . bam_sorted , output : gene_level = 'results/ {sample} / {sample} _genes_read_quantification.tsv' , gene_summary = 'results/ {sample} / {sample} _genes_read_quantification.summary' log : 'logs/ {sample} / {sample} _genes_read_quantification.log' benchmark : 'benchmarks/ {sample} / {sample} _genes_read_quantification.txt' resources : mem_mb = 500 shell : ''' echo \"Counting reads mapping on genes in <{input.bam_once_sorted}>\" > {log} featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a resources/Scerevisiae.gtf -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo \"Renaming output files\" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo \"Results saved in <{output.gene_level}>\" >> {log} echo \"Report saved in <{output.gene_summary}>\" >> {log} '''","title":"Creating a rule to count mapped reads"},{"location":"course_material/day2/3_generalising_snakemake/#featurecounts-options","text":"-t : specify on which feature type to count the reads -g : specify if and how to gather feature counts. Here, reads are counted by exon ( -t ) and the exon counts are gathered by genes \u2018meta-features\u2019 ( -g ) -s : perform strand-specific read counting Strandedness is determined by looking at the mRNA library preparation kit. It can also be determined a posteriori with scripts such as infer_experiment.py from the RSeQC package -p : count fragments instead of reads. If you don\u2019t use this option with paired-end reads, featureCounts won\u2019t be able to assign the read-pairs to features -B : only count read pairs that have both ends aligned -C : do not count read pairs that have their two ends mapping to different chromosomes or mapping on the same chromosome but on different strands --largestOverlap : assign reads to the meta-feature/feature that has the largest number of overlapping bases -F : specify format of the provided annotation file --verbose : output verbose information, such as unmatched chromosome/contig names","title":"featureCounts options"},{"location":"course_material/day2/3_generalising_snakemake/#running-the-workflow","text":"Exercise: If you have not done it after each step, it is now time to run the entire workflow on your sample of choice. What command will you use to run it? Answer Because all the rules are chained together, you only need to specify one of the final outputs to trigger the execution of all the previous rules: snakemake --cores 1 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv . You can add the -F flag to force an entire re-run. The entire run should take about ~10 min to complete. Exercise: Check Snakemake\u2019s log in .snakemake/log/ . Is everything as you expected, especially the wildcards values, input and output names etc\u2026? Answer cat .snakemake/log/","title":"Running the workflow"},{"location":"course_material/day2/3_generalising_snakemake/#visualising-the-dag-of-the-workflow","text":"We have now implemented and run the main steps of our workflow. It is always a good idea to visualise the whole process to check for errors and inconsistencies. Snakemake\u2019s has a built-in workflow visualisation feature to do this. Exercise: Visualise the entire workflow\u2019s Directed Acyclic Graph using the --dag flag. Do you need to specify a target? Hint Try to follow the official recommendations on workflow structure, which states that images are supposed to go in the images/ subfolder Snakemake prints a DAG in text format, so we need to use the dot command to transform it into a picture Save the result as a PNG picture Answer If we run the command without target: snakemake --cores 1 --dag -F | dot -Tpng > images/dag.png , we will get a Target rules may not contain wildcards. error, which means we need to add a target. Same as before, it makes sense to use one of the final outputs to get the entire workflow: snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png . But once again, we will get an error: BrokenPipeError: [Errno 32] Broken pipe . This is because we are piping the command output to a folder ( images/ ) that does not exist yet The folder is not created by Snakemake because it isn\u2019t handled as part of an actual run. So we have to create the folder before generating the DAG: mkdir images snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png Some explanations on the command: -F : force to show the entire worklow and ensures all jobs are shown. You can also use -f to show fewer jobs dot : tool that is a part of the graphviz package and is used to draw hierarchical or layered drawings of directed graphs, i.e. graphs in which edges (arrows) have a direction -T : control the image format. Available formats are listed here DAG aspect If you already computed all the outputs of the workflow, steps in the DAG will have dotted lines. To visualise the DAG before running the workflow, add -F/--forceall to the snakemake command to force the execution of all jobs. DAG = dry-run The --dag flag implicitly activates the --dry-run/--dryrun/-n option, which means that no jobs are executed during the plot creation. There are actually 3 types of DAG: A DAG, created with the --dag option A filegraph, created with the --filegraph option A rulegraph, created with the --rulegraph option Exercise: Generate the filegraph and rulegraph of your workflow. Feel free to try different pictures format. What are the differences between the plots? Answer Generate the rulegraph: snakemake --cores 1 --rulegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpdf > images/rulegraph.pdf Generate the filegraph: snakemake --cores 1 --filegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tjpg > images/filegraph.jpg You should obtain the 3 following figures, respectively DAG, rulegraph and filegraph: The differences are: --dag : dependency graph of all the jobs --filegraph : dependency graph of rules with inputs and outputs (rule appears once, with wildcards) --rulegraph : dependency graph of rules (rule appears once)","title":"Visualising the DAG of the workflow"},{"location":"course_material/day2/3_generalising_snakemake/#designing-a-snakemake-workflow-and-debugging-it","text":"","title":"Designing a Snakemake workflow... and debugging it!"},{"location":"course_material/day2/3_generalising_snakemake/#designing-a-workflow","text":"There are many ways to design a new workflow, but these few pieces of advice will be useful in most cases: Start with a pen and paper: try to find out how many rules you will need and how they depend on each other. In other terms, start by sketching the DAG of your workflow! Remember that Snakemake has a bottom-up approach (it goes from the final outputs to the first input), so it may be easier for you to work in that order as well and write your last rule first Determine which rules (if any) aggregate or split inputs and create input functions accordingly (we will see how these functions work in session 4) Make sure your input and output directives are right before worrying about anything else, especially the shell sections. Remember that Snakemake builds the DAG before running the shell commands, so you can use the --dryrun option to test the workflow before running it. You can even do that without writing all the shell commands! List any parameters or settings that might need to be adjusted Choose meaningful and easy-to-understand names for your inputs, outputs, parameters, wildcards\u2026 to make your Snakefile as readable as possible. This is true for every script, piece of code, variable etc\u2026 and Snakemake is no exception! Have a look at The Zen of Python for more information","title":"Designing a workflow"},{"location":"course_material/day2/3_generalising_snakemake/#debugging-a-workflow","text":"It is very likely you will see bugs and errors the first time you try to run a new Snakefile: don\u2019t be discouraged, this is normal! Order of operations in Snakemake The topic was tackled when DAGs were mentioned, but to efficiently debug a workflow, it is worth taking a deeper look at what Snakemake does when you execute the command snakemake --cores 1 . There are 3 main phases: Prepare to run: Read all the rule definitions from the Snakefile Resolve the DAG (when Snakemake says \u2018Building DAG of jobs\u2019): Check what output(s) are required Look for a matching rule by looking at the outputs of all the rules Fill in the wildcards to determine the input of the matching rule Check whether this input is available; if not, repeat Step 2 until everything is resolved Run: If needed, create the folder for the output(s) If needed, remove the outdated output(s) Run the shell command with the placeholders replaced Check that the command ran without errors and produced the expected output(s) Debugging advice Sometimes, Snakemake will give you a precise error report, but other times less so. Try to identify which phase of execution failed (see previous paragraph on order of operations) and double-check the most common error causes for that phase: Parsing phase failures (phase 1): Syntax errors, among which (but not limited to): This errors can be easily solved using a text editor with Python/Snakemake text colouring Missing commas/colons/semicolons Unbalanced quotes/brackets/parenthesis Wrong indentation Failure to evaluate expressions Problems in functions ( expand() , input functions\u2026) in input/output directives Python logic added outside of rules Other problems with rule definition Invalid rule names/directives Invalid wildcard names Mismatched wildcards DAG building failures (phase 2, before Snakemake tries to run any job): Failure to determine the target Ambiguous rules making the same output(s) On the contrary, no rule making the required output(s) Circular dependency (violating the \u2018Acyclic\u2019 property of a D A G). Write-protected output(s) DAG running failures (phase 3, --dry-run works and builds the DAG, but the real execution fails): When a job fails, Snakemake reports an error, deletes all output file(s) for that job (potential corruption), and stops Shell command returning non-zero status Missing output file(s) after the commands have run Reference to a $shell_variable before it was set Use of a wrong/unknown placeholder inside { }","title":"Debugging a workflow"},{"location":"course_material/day2/4_decorating_workflow/","text":"Learning outcomes After having completed this chapter you will be able to: Optimise a workflow by multi-threading Use non-file parameters and config files in rules Create rules with non-conventional outputs Modularise a workflow Make a workflow process a list of files rather than one file at a time Exercises In this series of exercises, we will create only one new rule to add to our workflow, because this part aims mainly to show how to improve and \u2018decorate\u2019 the rules we previously wrote. Development and back-up During this session, we will modify our Snakefile quite heavily, so it may be a good idea to start by making a back-up: cp worklow/Snakefile worklow/Snakefile_backup . As a general rule, if you have a doubt on the code you are developing, do not hesitate to make a back-up. Optimising a workflow by multi-threading When working with real datasets, most processes are very long and computationally expensive. Fortunately, they can be parallelised very efficiently to decrease the computation time by using several threads for a single job. Exercise: Parallelise as much processes as possible using the threads directive and test its effect: Identify which software can make use of parallelisation Identify in each software the parameter that controls multi-threading Implement the multi-threading Hint Check the software documentation and parameters with the -h/--help flags Remember that multi-threading only applies to software that can make use of a threads parameters, Snakemake itself cannot parallelize a software automatically Remember that you need to add threads to the Snakemake rule but also to the commands! Just increasing the number of threads in Snakemake will not magically run a command with multiple threads Remember that you have 4 threads in total, so even if you ask for more in a rule, Snakemake will cap this value at 4. And if you use 4 threads in a rule, that means that no other job can run parallel! Answer It turns out that all the software except samtools index can handle multi-threading: atropos trim , hisat2 , samtools view , and samtools sort use the --threads option featureCounts uses the -T option Let\u2019s use 4 threads for the mapping step and 2 for the other steps. Your Snakefile should look like this: rule fastq_trim: ''' This rule trims paired-end reads to improve their quality. Specifically, it removes: - Low quality bases - A stretches longer than 20 bases - N stretches ''' input: reads1 = 'data/{sample}_1.fastq', reads2 = 'data/{sample}_2.fastq', output: trim1 = 'results/{sample}/{sample}_atropos_trimmed_1.fastq', trim2 = 'results/{sample}/{sample}_atropos_trimmed_2.fastq' log: 'logs/{sample}/{sample}_atropos_trimming.log' benchmark: 'benchmarks/{sample}/{sample}_atropos_trimming.txt' resources: mem_mb = 500 threads: 2 shell: ''' echo \"Trimming reads in <{input.reads1}> and <{input.reads2}>\" > {log} atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" --threads {threads} -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo \"Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively\" >> {log} echo \"Trimming report saved in <{log}>\" >> {log} ''' rule read_mapping: ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input: trim1 = rules.fastq_trim.output.trim1, trim2 = rules.fastq_trim.output.trim2 output: sam = 'results/{sample}/{sample}_mapped_reads.sam', report = 'results/{sample}/{sample}_mapping_report.txt' log: 'logs/{sample}/{sample}_mapping.log' benchmark: 'benchmarks/{sample}/{sample}_mapping.txt' resources: mem_gb = 2 threads: 4 shell: ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x resources/genome_indices/Scerevisiae_index --threads {threads} -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' rule sam_to_bam: ''' This rule converts a sam file to bam format, sorts it and indexes it. ''' input: sam = rules.read_mapping.output.sam output: bam = 'results/{sample}/{sample}_mapped_reads.bam', bam_sorted = 'results/{sample}/{sample}_mapped_reads_sorted.bam', index = 'results/{sample}/{sample}_mapped_reads_sorted.bam.bai' log: 'logs/{sample}/{sample}_mapping_sam_to_bam.log' benchmark: 'benchmarks/{sample}/{sample}_mapping_sam_to_bam.txt' resources: mem_mb = 250 threads: 2 shell: ''' echo \"Converting <{input.sam}> to BAM format\" > {log} samtools view {input.sam} --threads {threads} -b -o {output.bam} 2>> {log} echo \"Sorting BAM file\" >> {log} samtools sort {output.bam} --threads {threads} -O bam -o {output.bam_sorted} 2>> {log} echo \"Indexing the sorted BAM file\" >> {log} samtools index -b {output.bam_sorted} -o {output.index} 2>> {log} echo \"Sorted file saved in <{output.bam_sorted}>\" >> {log} ''' rule reads_quantification_genes: ''' This rule quantifies the reads of a bam file mapping on genes and produces a count table for all genes of the assembly. The strandedness parameter is determined by get_strandedness(). ''' input: bam_once_sorted = rules.sam_to_bam.output.bam_sorted, output: gene_level = 'results/{sample}/{sample}_genes_read_quantification.tsv', gene_summary = 'results/{sample}/{sample}_genes_read_quantification.summary' log: 'logs/{sample}/{sample}_genes_read_quantification.log' benchmark: 'benchmarks/{sample}/{sample}_genes_read_quantification.txt' resources: mem_mb = 500 threads: 2 shell: ''' echo \"Counting reads mapping on genes in <{input.bam_once_sorted}>\" > {log} featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a resources/Scerevisiae.gtf -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo \"Renaming output files\" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo \"Results saved in <{output.gene_level}>\" >> {log} echo \"Report saved in <{output.gene_summary}>\" >> {log} ''' Exercise: Finally, test the effect of the number of threads on the workflow\u2019s runtime. What command will you use to run the workflow? Does the workflow run faster? Answer The command to use is snakemake --cores 4 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv . Do not forget to provide additional cores to Snakemake in the execution command with --cores 4 . Note that the number of threads allocated to all jobs running at a given time cannot exceed the value specified with --cores . Therefore, if you leave this number at 1, Snakemake will not be able to use multiple threads. Also note that increasing --cores allows Snakemake to run multiple jobs in parallel (for example, running 2 jobs using 2 threads each). The workflow now takes ~6 min to run, compared to ~10 min before ( i.e. a 40% decrease!). This gives you an idea of how powerful multi-threading is when the datasets and computing power get bigger! Explicit is better than implicit Even if a software cannot multi-thread, it is useful to add threads: 1 in the rule to keep the rule consistency and clearly state that the software works with a single thread. Keep in mind when using parallel execution Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash The on-screen output from parallel jobs will be mixed, so save any output to log files instead Using non-file parameters and config files Non-file parameters As we have seen, Snakemake\u2019s execution is based around inputs and outputs of each step of the workflow. However, a lot of software rely on additional non-file parameters. In the previous presentation and series of exercises, we advocated (rightfully so!) against using hard-coded filepaths. Yet, if you look back at the rules we have implemented, you will find 2 occurrences of this behaviour in the shell command: In the rule read_mapping , the index parameter -x resources/genome_indices/Scerevisiae_index In the rule reads_quantification_genes , the annotation parameter -a resources/Scerevisiae.gtf This reduces readability and also makes it very hard to change the value of these parameters. The params directive was designed for this purpose: it allows to specify additional parameters that can also depend on the wildcards values and use input functions (see Session 4 for more information on this). params values can be of any type (integer, string, list etc\u2026) and similarly to the {input} and {output} placeholders, they can also be accessed from the shell command with the placeholder arams} . Just like for the input and output directives, you can define multiple parameters (in this case, do not forget the comma between each entry!) and they can be named (in practice, unknown parameters are unexplicit and easily confusing, so parameters should always be named!). It also helps readability and clarity to use the params section to name and assign parameters and variables for your shell command. Here is an example on how to use params : rule example : input : 'data/example.tsv' output : 'results/example.txt' params : lines = 5 shell : 'head -n {params.lines} {input} > {output} ' Parameters arguments In contrast to the input directive, the params directive can optionally take more arguments than only wildcards, namely input, output, threads, and resources. Exercise: Replace the two hard-coded paths mentioned earlier by params . Hint Add a params directive to the rules, name the parameter and replace the path by the placeholder in the shell command. Answer Note: for clarity, only the lines that changed are shown below. rule read_mapping params : index = 'resources/genome_indices/Scerevisiae_index' shell : 'hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x {params.index} --threads {threads} -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} ' rule reads_quantification_genes params : annotations = 'resources/Scerevisiae.gtf' shell : 'featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a {params.annotations} -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} ' Snakemake re-run behaviour If you try to re-run only the last rule with snakemake --cores 4 -r -p -f results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv , Snakemake will actually try to re-run 3 rules in total. This is because the code changed in 2 rules (see reason field in Snakemake\u2019s log), which triggered an update of the inputs in the 3rd rule ( sam_to_bam ). To avoid this, first touch the files with snakemake --cores 1 --touch -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv then re-run the last rule. Config files That being said, there is an even better way to handle parameters like the we just modified: instead of hard-coding parameter values in the Snakefile, Snakemake allows to define parameters and their values in config files. The config files will be parsed by Snakemake when executing the workflow, and parameters and their values will be stored in a Python dictionary named config . The path to the config file can be specified either in the Snakefile with the line configfile: at the top of the file, or it can be specified at runtime with the execution parameter --configfile . Config files are stored in the config subfolder and written in the JSON or YAML format. We will use the latter for this course as it is the most user-friendly and the recommended one. Briefly, in the YAML format, parameters are defined with the syntax : . Values can be strings, integers, floating points, booleans \u2026 For a complete overview of available value types, see this list . A parameter can have multiple values, which are then each listed on an indented single line starting with \u201c - \u201d. These values will be stored in a Python list when Snakemake parses the config file. Finally, parameters can be nested on indented single lines, and they will be stored as a dictionary when Snakemake parses the config file. The example below shows a parameter with a single value ( lines_number ), a parameter with multiple values ( samples ), and an example of nested parameters ( resources ): # Parameter with a single value (string, int, float, bool ...) lines_number : 5 # Parameter with multiple values samples : - sample1 - sample2 # Nested parameters resources : threads : 4 memory : 4G Then, each parameter can be accessed in Snakefile with the following syntax: config [ 'lines_number' ] # --> 5 config [ 'samples' ] # --> ['sample1', 'sample2'] # Lists of parameters become list config [ 'resources' ] # --> {'threads': 4, 'memory': '4G'} # Lists of named parameters become dictionaries config [ 'resources' ][ 'threads' ] # --> 4 Accessing config values in shell Values stored in the config dictionary cannot be accessed directly within the shell directive. If you need to use a parameter value in shell , define the parameter in params and assign its value from the config dictionary. Exercise: Create a config file in YAML format and fill it with adapted variables and values to replace the 2 hard-coded parameters in rules read_mapping and reads_quantification_genes . Then replace the hard-coded parameters by values from the config file and add its path on top of your Snakefile. Answer Note: for clarity, only the lines that changed are shown below. The first step is to create the subfolder and an empty config file: mkdir config # Create a new folder touch config/config.yaml # Create an empty config file Then, fill the config file with the desired values: # Configuration options of RNAseq-analysis workflow # Location of the genome indices index : 'resources/genome_indices/Scerevisiae_index' # Location of the annotation file annotations : 'resources/Scerevisiae.gtf' Then, replace the params values in the Snakefile: rule read_mapping params : index = config [ 'index' ] rule reads_quantification_genes params : annotations = config [ 'annotations' ] Finally, add the file path on top of the Snakefile: configfile: 'config/config.yaml' Now, if we need to change these values, we can easily do it in the config file instead of modifying the code! Using non-conventional outputs Snakemake has several built-in utilities to assign properties to outputs that are deemed \u2018special\u2019. These properties are listed in the table below: Property Syntax Function Temporary temp('path/to/file.txt') File is deleted as soon as it is not required by any future jobs Protected protected('path/to/file.txt') File cannot be overwritten after the job ends (useful to prevent erasing a file by mistake) Ancient ancient('path/to/file.txt') File will not be re-created when running the pipeline (useful for files that require heavy computation) Directory directory('path/to/directory') Output is a directory instead of a file (use \u2018touch\u2019 instead if possible) Touch touch('path/to/file.txt') Create an empty flag file \u2018file.txt\u2019 regardless of the shell command (if the command finished without errors) The next paragraphs will show how to use some of these properties. Use-case of the temp() command Exercise: Can you think of a convenient use of temp() command? Answer The temp() command is extremely useful to automatically remove intermediary outputs that are no longer needed. Exercise: In your workflow, identify outputs that are intermediary and mark them as temporary with temp() . Answer The unsorted .bam and the .sam outputs seem like great candidates to be marked as temporary. One could also argue that the trimmed FASTQ files are also temporary, but we will keep them for now. Note: for clarity, only the lines that changed are shown below. rule read_mapping output : sam = temp ( 'results/ {sample} / {sample} _mapped_reads.sam' ), rule sam_to_bam output : bam = temp ( 'results/ {sample} / {sample} _mapped_reads.bam' ), Consequences of using temp() Removing temporary outputs is a great way to save a lot of storage space. If you look at the size of your current results/ folder ( du -bchd0 results/ ), you will notice that it drastically. Just removing these two files would allow to save ~1 GB. While it may not seem a lot, remember that you usually have much bigger files and many more samples! On the other hand, using temporary outputs might force you to re-run more jobs than necessary if an input changes, so carefully think about it before using it. Exercise: On the contrary, is there a file of your workflow that you would like to protect with protected() Answer This is debatable, but one could argue that the sorted .bam file is a good candidate for protection. rule sam_to_bam output : bam_sorted = protected ( 'results/ {sample} / {sample} _mapped_reads_sorted.bam' ), If you set this output as protected, be careful when you want to re-run your workflow and recreate the file! Use-case of the directory() command: the FastQC example FastQC is a program designed to spot potential problems in high-througput sequencing datasets. It is a very popular tool, notably because it runs quickly and does not require a lot of configuration. It runs a set of analyses on one or more raw sequence files in FASTQ or BAM format and produces a report with quality plots that summarises the results. It will highlight any areas where a dataset looks unusual and might require a closer look. FastQC can be run interactively or in batch mode, during which it saves results as an HTML file and a ZIP file. We will soon see that running FastQC in batch mode presents a little problem. Data types and FastQC FastQC does not differentiate between sequencing techniques and as such can be used to look at libraries coming from a large number of experiments (Genomic Sequencing, ChIP-Seq, RNAseq, BS-Seq etc\u2026). If you run fastqc -h , you will notice something a bit surprising (but not unusual in bioinformatics): -o --outdir Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed. -d --dir Selects a directory to be used for temporary files written when generating report images. Defaults to system temp directory if not specified. Two files are produced for each FASTQ file and these files appear in the same directory as the input file: FastQC does not allow to specify the names of the output files! However, we can set an alternative output directory, even though it needs to be created before FastQC is run. There are different solutions to deal with this problem: Work with the default file names produced by FastQC and leave the reports in the same directory than the input files Create the outputs in a new directory and leave the reports with their default name Create the outputs in a new directory and tell Snakemake that the directory itself is the output Force a naming convention by renaming the FastQC output files within the rule For the sake of time, we will not test all 4 solutions, but rather try to apply the 3 rd or the 4 th solution. We\u2019ll briefly summarise solutions 1 and 2 here: This could work, but it\u2019s better not to put the reports in the same directory than the input sequences. As a general principle, when writing Snakemake rules, we prefer to be in charge of the output names and to have all the files linked to a sample in the same directory This involves manually constructing the output directory path to use with the -o option, which works but isn\u2019t very convenient The base of the FastQC command is the following: fastqc --format fastq --threads 2 -t/--threads : specify the number of files which can be processed simultaneously. Here, it will be 2 because the inputs are paired-end files The -o and -d will be used in the last 2 solutions that we will now see in details We will create a single rule to run FastQC on both the original and the trimmed FASTQ files Choose only one solution to implement: Solution 3 Solution 4 This option amounts to tell Snakemake not to worry about individual files at all and consider the output of the rule as an entire directory. Exercise: Implement a single rule to run FastQC on both the original and the trimmed FASTQ files (4 files in total) using directories as ouputs with the directory() command. Answer This makes the rule definition quite \u2018simple\u2019: rule fastq_qc_sol3 : ''' This rule performs a QC on paired-end fastq files before and after trimming. ''' input : reads1 = rules . fastq_trim . input . reads1 , reads2 = rules . fastq_trim . input . reads2 , trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : before_trim = directory ( 'results/ {sample} /fastqc_reports/before_trim/' ), after_trim = directory ( 'results/ {sample} /fastqc_reports/after_trim/' ) log : 'logs/ {sample} / {sample} _fastqc.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_fastqc.txt' resources : mem_gb = 1 threads : 2 shell : ''' echo \"Creating output directory <{output.before_trim}>\" > {log} mkdir -p {output.before_trim} 2>> {log} echo \"Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {output.before_trim} --dir {output.before_trim} {input.reads1} {input.reads2} &>> {log} echo \"Results saved in <{output.before_trim}>\" >> {log} echo \"Creating output directory <{output.after_trim}>\" >> {log} mkdir -p {output.after_trim} 2>> {log} echo \"Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {output.after_trim} --dir {output.after_trim} {input.trim1} {input.trim2} &>> {log} echo \"Results saved in <{output.after_trim}>\" >> {log} ''' .snakemake_timestamp When directory() is used, Snakemake creates an empty file called .snakemake_timestamp in the output directory. This is the marker it uses to know if it needs to re-run the rule producing the directory. Overall, this rule works well and allows for an easy rule definition. However, in this case, individual files are not explicitly named as outputs and this may cause problems to chain rules later. Also, remember that some applications won\u2019t give you any control at all over the outputs, which is why you need a back-up plan, i.e. solution 4: the most powerful solution is to use shell commands to move and/or rename the files to the names you want. Also, the Snakemake developers advise to use directory() as a last resort and to rather use the touch() flag instead. This option amounts to let FastQC follows its default behaviour but force the renaming of the files afterwards to obtain the exact outputs we require. Exercise: Implement a single rule to run FastQC on both the original and the trimmed FASTQ files (4 files in total) and rename the files created by FastQC to precise output names using the mv command. Answer This makes the rule definition (much) more complicated than the other solution: rule fastq_qc_sol4 : ''' This rule performs a QC on paired-end fastq files before and after trimming. ''' input : reads1 = rules . fastq_trim . input . reads1 , reads2 = rules . fastq_trim . input . reads2 , trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : # QC before trimming html1_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_1.html' , zipfile1_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_1.zip' , html2_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_2.html' , zipfile2_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_2.zip' , # QC after trimming html1_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_1.html' , zipfile1_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_1.zip' , html2_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_2.html' , zipfile2_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_2.zip' params : wd = 'results/ {sample} /fastqc_reports/' , # QC before trimming html1_before = 'results/ {sample} /fastqc_reports/ {sample} _1_fastqc.html' , zipfile1_before = 'results/ {sample} /fastqc_reports/ {sample} _1_fastqc.zip' , html2_before = 'results/ {sample} /fastqc_reports/ {sample} _2_fastqc.html' , zipfile2_before = 'results/ {sample} /fastqc_reports/ {sample} _2_fastqc.zip' , # QC after trimming html1_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_1_fastqc.html' , zipfile1_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_1_fastqc.zip' , html2_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_2_fastqc.html' , zipfile2_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_2_fastqc.zip' log : 'logs/ {sample} / {sample} _fastqc.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_fastqc.txt' resources : mem_gb = 1 threads : 2 shell : ''' echo \"Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {params.wd} --dir {params.wd} {input.reads1} {input.reads2} &>> {log} echo \"Renaming results from original fastq analysis\" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_before} {output.html1_before} 2>> {log} mv {params.zipfile1_before} {output.zipfile1_before} 2>> {log} mv {params.html2_before} {output.html2_before} 2>> {log} mv {params.zipfile2_before} {output.zipfile2_before} 2>> {log} echo \"Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {params.wd} --dir {params.wd} {input.trim1} {input.trim2} &>> {log} echo \"Renaming results from trimmed fastq analysis\" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_after} {output.html1_after} 2>> {log} mv {params.zipfile1_after} {output.zipfile1_after} 2>> {log} mv {params.html2_after} {output.html2_after} 2>> {log} mv {params.zipfile2_after} {output.zipfile2_after} 2>> {log} echo \"Results saved in \" >> {log} ''' This solution is very long and much more complicated than the other one. However, it makes up for the complexity by allowing a total control on what is happening: with this method, we can choose where the temporary files are saved and the names of the outputs. It could have been shortened by using -o . to tell FastQC to create the files in the current working directory instead of a specific one, but this would have created another problem: if we run multiple jobs in parallel, then Snakemake may potentially try to produce files from different jobs but with the same temporary destination. In this case, the different instances would be trying to write to the same temporary files at the same time, overwriting each other and corrupting the output files. Several interesting things are happening in both versions of this rule: Much like for the outputs, it is possible to refer to the inputs of a rule directly in another rule with the syntax rules..input. FastQC doesn\u2019t create the output directory by itself (other programs might insist that the output directory does not already exist), so we have to create it manually with mkdir in the shell command before running FastQC The -p flag of mkdir make parent directories as needed and does not return an error if the directory already exists Directory creation Remember that in most cases it is not necessary to manually create directories because Snakemake will do it for you. Even when using a directory( ) output, Snakemake will not create the directory itself but most applications will make the directory for you; FastQC is an exception. Hint If you want to make sure that a certain rule is executed before another, you can write the outputs of the first rule as inputs of the second one, even if you don\u2019t use them in the rule. For example, we could force the execution of FastQC before mapping the reads with only a few modifications to rule read_mapping : rule read_mapping: ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input: trim1 = rules.fastq_trim.output.trim1, trim2 = rules.fastq_trim.output.trim2, # Do not forget to add a comma here fastqc = rules.fastq_qc_sol4.output.html1_before # This single line will force the execution of FASTQC before read mapping output: sam = 'results/{sample}/{sample}_mapped_reads.sam', report = 'results/{sample}/{sample}_mapping_report.txt' params: index = 'resources/genome_indices/Scerevisiae_index' log: 'logs/{sample}/{sample}_mapping.log' benchmark: 'benchmarks/{sample}/{sample}_mapping.txt' resources: mem_gb = 2 threads: 4 shell: ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x {params.index} --threads {threads} -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' Modularising a workflow If you keep developing a workflow long enough, you are bound to encounter some cluttering problems. Have a look at your current Snakefile: with only 5 rules, it is already almost 200 lines long. Imagine what happens when your workflow comprises dozens of rules?! The Snakefile may become messy and harder to maintain and edit. This is why it quickly becomes crucial to modularise your workflow; this is a common practice in programming in general. This approach also makes it easier to re-use pieces of workflow in the future. Modularisation comes at 4 different levels: The most fine-grained level are wrappers. Wrappers allow to quickly use popular tools and libraries in Snakemake workflows, thanks to the wrapper directive. Wrappers are automatically downloaded and deploy a conda environment when running the workflow, which increases reproducibility, however their implementation can sometimes be \u2018rigid\u2019 and you may have to write your own rule. See the official documentation for more explanations For larger, reusable parts belonging to the same workflow, it is recommended to write smaller snakefiles and include them into a main Snakefile with the include statement. Note that in this case, all rules share a common config file. See the official documentation for more explanations The next level of modularisation is provided via the module statement, which enables arbitrary combination and re-use of rules in the same workflow and between workflows. See the official documentation for more explanations Finally, Snakemake also provides a syntax to define subworkflows, but this syntax is currently being deprecated in favor of the module statement. See the official documentation for more explanations In this course, we will only use the 2 nd level of modularisation. In more details, the idea is to write a main Snakefile in workflow/Snakefile , to place the other snakefiles containing the rules in the subfolder workflow/rules (these \u2018sub-Snakefile\u2019 should end with .smk , the recommended file extension of Snakemake) and to tell Snakemake to import the modular snakefiles in the main Snakefile with the include: syntax. Rules organisation How to organize rules is up to you, but a common approach would be to create \u201cthematic\u201d modules, i.e. regroup rules involved in the same general step of the workflow. Exercise: Move your current Snakefile into the subfolder workflow/rules and rename it to read_mapping.smk . Then create a new Snakefile in workflow/ and import read_mapping.smk in it using the include syntax. You should also move the importation of the config file from the modular Snakefile to the main one. Answer We will solve this problem step by step. First, create the new file structure: mkdir workflow/rules # Create a new folder mv workflow/Snakefile workflow/rules/read_mapping.smk # Move and rename the modular snakefile touch workflow/Snakefile # Recreate the main Snakefile Then, fill the main Snakefile with include and configfile : ''' Main Snakefile of the RNAseq analysis workflow. This workflow can clean and map reads, and perform Differential Expression Analyses. ''' # Path of the config file configfile : 'config/config.yaml' # Rules to execute the workflow include : 'rules/read_mapping.smk' Finally, do not forget to remove the config file import ( configfile: 'config/config.yaml' ) from the snakefiles ( workflow/rules/read_mapping.smk ) Relative paths Includes are relative to the directory of the Snakefile in which they occur. For example, if the Snakefile resides in workflow , then Snakemake will search for the included snakefiles in workflow/path/to/other/snakefile , regardless of the working directory You can place snakefiles in a sub-directory without changing input and output paths, as these paths are relative to the working directory. However, you will need to edit paths to external scripts and conda environments, as these paths are relative to the snakefile from which they are called (this will be discussed in the last series of exercises) In practice, you can imagine that the line include: is replaced by the entire content of snakefile.smk in Snakefile . This means that syntaxes like rules..output. can still be used in snakefiles, even if the rule was defined in another snakefile, as long as the snakefile in which is defined is included before the snakefile that uses rules..output . This also works for input and output functions. Using a target rule and aggregating outputs Creating a target rule Modularisation also offers a great opportunity to facilitate the execution of the workflow. By default, if no target is given at the command line, Snakemake executes the first rule in the Snakefile. Hence, we have always executed the workflow by specifying a target file in the command line to avoid this behaviour. But we can actually use this property to make the execution easier by writing a pseudo-rule (also called target-rule and usually named rule all ) in the Snakefile which has all the desired outputs (or a particular subsets of them) files as input files. This rule will look like this: rule all : input : 'path/to/ouput1' , 'path/to/ouput2' Order of rules in Snakefile/snakefiles Apart from Snakemake considering the first rule of the workflow as the default target, the order of rules in the Snakefile/snakefiles is arbitrary and does not influence the DAG of jobs. Exercise: Implement a special rule in the Snakefile so that the final output is generated by default when running snakemake without specifying a target, then test your workflow with a dry-run. Hint Remember that a rule is not required to have an output nor a shell command The inputs of rule all should be the final outputs that you want to generate (those from the last rule you wrote) Answer If we consider that the last outputs are the ones produced by rule reads_quantification_genes , we can write the target rule like this: # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : 'results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv' Note that we used only one of the two outputs of rule reads_quantification_genes . We do this because it is enough to trigger the execution and if the rule didn\u2019t produce both outputs, Snakemake would crash and report it this error. Now, let\u2019s try to do a dry-run with this new rule: snakemake --cores 4 -F -r -p -n . You should see all the rules appearing thanks to the -F flag, including: localrule all: input: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv jobid: 0 reason: Input files updated by another job: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv resources: tmpdir = /tmp Job stats: job count -------------------------- ------- all 1 fastq_qc_sol4 1 fastq_trim 1 read_mapping 1 reads_quantification_genes 1 sam_to_bam 1 total 6 Aggregating outputs Using a target rule like the one presented in the previous paragraph gives another opportunity to make things easier. In the rule we just created, we used a hard-coded input and by now, you should know that this is not an optimal solution and that we should avoid this as much as possible, especially if you have many samples to process. To solve this problem, we will rely on the expand function . Exercise: Write an expand() syntax to generate a list of outputs from rule reads_quantification_genes with all the RNAseq samples . What do you need to write this? Answer The output of rule reads_quantification_genes has the following syntax: 'results/{sample}/{sample}_genes_read_quantification.tsv' . First, we need to create a Python list containing all the values that the {sample} wildcards can take: SAMPLES = ['highCO2_sample1', 'highCO2_sample2', 'highCO2_sample3', 'lowCO2_sample1', 'lowCO2_sample2', 'lowCO2_sample3'] Then, we can transform the output syntax with expand() : expand('results/{sample}/{sample}_genes_read_quantification.tsv', sample=SAMPLES) Exercise: Use these two elements (the list of samples and the expand() syntax) in the target rule to ask Snakemake to generate all the outputs. Answer You need to add the sample list to the Snakefile before the rule all and replace the value of the input directive: # Sample list SAMPLES = [ 'highCO2_sample1' , 'highCO2_sample2' , 'highCO2_sample3' , 'lowCO2_sample1' , 'lowCO2_sample2' , 'lowCO2_sample3' ] # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : expand ( 'results/ {sample} / {sample} _genes_read_quantification.tsv' , sample = SAMPLES ) If you launch the workflow in dry-run mode with this new rule: snakemake --cores 4 -F -r -p -n . You should see all the rules appearing 5 times (1 for each sample that hasn\u2019t been processed yet): localrule all: input: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv, results/highCO2_sample2/highCO2_sample2_genes_read_quantification.tsv, results/highCO2_sample3/highCO2_sample3_genes_read_quantification.tsv, results/lowCO2_sample1/lowCO2_sample1_genes_read_quantification.tsv, results/lowCO2_sample2/lowCO2_sample2_genes_read_quantification.tsv, results/lowCO2_sample3/lowCO2_sample3_genes_read_quantification.tsv jobid: 0 reason: Input files updated by another job: results/lowCO2_sample1/lowCO2_sample1_genes_read_quantification.tsv, results/lowCO2_sample2/lowCO2_sample2_genes_read_quantification.tsv, results/lowCO2_sample3/lowCO2_sample3_genes_read_quantification.tsv, results/highCO2_sample3/highCO2_sample3_genes_read_quantification.tsv, results/highCO2_sample2/highCO2_sample2_genes_read_quantification.tsv resources: tmpdir = /tmp Job stats: job count -------------------------- ------- all 1 fastq_qc_sol4 5 fastq_trim 5 read_mapping 5 reads_quantification_genes 5 sam_to_bam 5 total 26 But we can do even better! At the moment, samples are defined in a list at the top of the Snakefile. To further improve the workflow\u2019s usability, we can define samples in the config file, so they can easily be added, removed, or modified by the user. Exercise: Implement a parameter in the config file to specify sample names and modify rule all to use this parameter in the expand() syntax. Answer First, we need to modify the config file: # Configuration options of RNAseq-analysis workflow # Location of the genome indices index : 'resources/genome_indices/Scerevisiae_index' # Location of the annotation file annotations : 'resources/Scerevisiae.gtf' # Sample names samples : - highCO2_sample1 - highCO2_sample2 - highCO2_sample3 - lowCO2_sample1 - lowCO2_sample2 - lowCO2_sample3 Then, we need to use the config file in the expand() syntax (and remove SAMPLES from the Snakefile, because we don\u2019t need this variable anymore): # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : expand ( 'results/ {sample} / {sample} _genes_read_quantification.tsv' , sample = config [ 'samples' ]) Here, config['samples'] is a Python list containing strings, each string being a sample name. This is because a list of parameters become a list during the config file parsing. An even more Snakemake-idiomatic solution There is an even better and more Snakemake-idiomatic version of the expand() syntax: expand(rules.reads_quantification_genes.output.gene_level, sample=config['samples']) . While it may not seem easy to use and understand, this entirely removes the need to write the output paths! Running the other samples of the workflow Exercise: Touch the files already present in your workflow to avoid re-creating them and then run your workflow on the 5 other samples. Answer To touch the existing files, you can use: snakemake --cores 1 --touch To run the workflow, you can use snakemake --cores 4 -r -p Thanks to the parallelisation, the workflow execution should take less than 10 min in total to process all the samples! Exercise: Generate the workflow DAG and filegraph. Answer Generate the DAG: snakemake --cores 1 -F -r -p --rulegraph | dot -Tpng > images/all_samples_rulegraph.png Generate the filegraph: snakemake --cores 1 -F -r -p --filegraph | dot -Tpng > images/all_samples_filegraph.png Your DAG should resemble this: And your filegraph, this:","title":"Decorating and optimising a Snakemake workflow"},{"location":"course_material/day2/4_decorating_workflow/#learning-outcomes","text":"After having completed this chapter you will be able to: Optimise a workflow by multi-threading Use non-file parameters and config files in rules Create rules with non-conventional outputs Modularise a workflow Make a workflow process a list of files rather than one file at a time","title":"Learning outcomes"},{"location":"course_material/day2/4_decorating_workflow/#exercises","text":"In this series of exercises, we will create only one new rule to add to our workflow, because this part aims mainly to show how to improve and \u2018decorate\u2019 the rules we previously wrote. Development and back-up During this session, we will modify our Snakefile quite heavily, so it may be a good idea to start by making a back-up: cp worklow/Snakefile worklow/Snakefile_backup . As a general rule, if you have a doubt on the code you are developing, do not hesitate to make a back-up.","title":"Exercises"},{"location":"course_material/day2/4_decorating_workflow/#optimising-a-workflow-by-multi-threading","text":"When working with real datasets, most processes are very long and computationally expensive. Fortunately, they can be parallelised very efficiently to decrease the computation time by using several threads for a single job. Exercise: Parallelise as much processes as possible using the threads directive and test its effect: Identify which software can make use of parallelisation Identify in each software the parameter that controls multi-threading Implement the multi-threading Hint Check the software documentation and parameters with the -h/--help flags Remember that multi-threading only applies to software that can make use of a threads parameters, Snakemake itself cannot parallelize a software automatically Remember that you need to add threads to the Snakemake rule but also to the commands! Just increasing the number of threads in Snakemake will not magically run a command with multiple threads Remember that you have 4 threads in total, so even if you ask for more in a rule, Snakemake will cap this value at 4. And if you use 4 threads in a rule, that means that no other job can run parallel! Answer It turns out that all the software except samtools index can handle multi-threading: atropos trim , hisat2 , samtools view , and samtools sort use the --threads option featureCounts uses the -T option Let\u2019s use 4 threads for the mapping step and 2 for the other steps. Your Snakefile should look like this: rule fastq_trim: ''' This rule trims paired-end reads to improve their quality. Specifically, it removes: - Low quality bases - A stretches longer than 20 bases - N stretches ''' input: reads1 = 'data/{sample}_1.fastq', reads2 = 'data/{sample}_2.fastq', output: trim1 = 'results/{sample}/{sample}_atropos_trimmed_1.fastq', trim2 = 'results/{sample}/{sample}_atropos_trimmed_2.fastq' log: 'logs/{sample}/{sample}_atropos_trimming.log' benchmark: 'benchmarks/{sample}/{sample}_atropos_trimming.txt' resources: mem_mb = 500 threads: 2 shell: ''' echo \"Trimming reads in <{input.reads1}> and <{input.reads2}>\" > {log} atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" --threads {threads} -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo \"Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively\" >> {log} echo \"Trimming report saved in <{log}>\" >> {log} ''' rule read_mapping: ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input: trim1 = rules.fastq_trim.output.trim1, trim2 = rules.fastq_trim.output.trim2 output: sam = 'results/{sample}/{sample}_mapped_reads.sam', report = 'results/{sample}/{sample}_mapping_report.txt' log: 'logs/{sample}/{sample}_mapping.log' benchmark: 'benchmarks/{sample}/{sample}_mapping.txt' resources: mem_gb = 2 threads: 4 shell: ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x resources/genome_indices/Scerevisiae_index --threads {threads} -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' rule sam_to_bam: ''' This rule converts a sam file to bam format, sorts it and indexes it. ''' input: sam = rules.read_mapping.output.sam output: bam = 'results/{sample}/{sample}_mapped_reads.bam', bam_sorted = 'results/{sample}/{sample}_mapped_reads_sorted.bam', index = 'results/{sample}/{sample}_mapped_reads_sorted.bam.bai' log: 'logs/{sample}/{sample}_mapping_sam_to_bam.log' benchmark: 'benchmarks/{sample}/{sample}_mapping_sam_to_bam.txt' resources: mem_mb = 250 threads: 2 shell: ''' echo \"Converting <{input.sam}> to BAM format\" > {log} samtools view {input.sam} --threads {threads} -b -o {output.bam} 2>> {log} echo \"Sorting BAM file\" >> {log} samtools sort {output.bam} --threads {threads} -O bam -o {output.bam_sorted} 2>> {log} echo \"Indexing the sorted BAM file\" >> {log} samtools index -b {output.bam_sorted} -o {output.index} 2>> {log} echo \"Sorted file saved in <{output.bam_sorted}>\" >> {log} ''' rule reads_quantification_genes: ''' This rule quantifies the reads of a bam file mapping on genes and produces a count table for all genes of the assembly. The strandedness parameter is determined by get_strandedness(). ''' input: bam_once_sorted = rules.sam_to_bam.output.bam_sorted, output: gene_level = 'results/{sample}/{sample}_genes_read_quantification.tsv', gene_summary = 'results/{sample}/{sample}_genes_read_quantification.summary' log: 'logs/{sample}/{sample}_genes_read_quantification.log' benchmark: 'benchmarks/{sample}/{sample}_genes_read_quantification.txt' resources: mem_mb = 500 threads: 2 shell: ''' echo \"Counting reads mapping on genes in <{input.bam_once_sorted}>\" > {log} featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a resources/Scerevisiae.gtf -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo \"Renaming output files\" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo \"Results saved in <{output.gene_level}>\" >> {log} echo \"Report saved in <{output.gene_summary}>\" >> {log} ''' Exercise: Finally, test the effect of the number of threads on the workflow\u2019s runtime. What command will you use to run the workflow? Does the workflow run faster? Answer The command to use is snakemake --cores 4 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv . Do not forget to provide additional cores to Snakemake in the execution command with --cores 4 . Note that the number of threads allocated to all jobs running at a given time cannot exceed the value specified with --cores . Therefore, if you leave this number at 1, Snakemake will not be able to use multiple threads. Also note that increasing --cores allows Snakemake to run multiple jobs in parallel (for example, running 2 jobs using 2 threads each). The workflow now takes ~6 min to run, compared to ~10 min before ( i.e. a 40% decrease!). This gives you an idea of how powerful multi-threading is when the datasets and computing power get bigger! Explicit is better than implicit Even if a software cannot multi-thread, it is useful to add threads: 1 in the rule to keep the rule consistency and clearly state that the software works with a single thread. Keep in mind when using parallel execution Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash The on-screen output from parallel jobs will be mixed, so save any output to log files instead","title":"Optimising a workflow by multi-threading"},{"location":"course_material/day2/4_decorating_workflow/#using-non-file-parameters-and-config-files","text":"","title":"Using non-file parameters and config files"},{"location":"course_material/day2/4_decorating_workflow/#non-file-parameters","text":"As we have seen, Snakemake\u2019s execution is based around inputs and outputs of each step of the workflow. However, a lot of software rely on additional non-file parameters. In the previous presentation and series of exercises, we advocated (rightfully so!) against using hard-coded filepaths. Yet, if you look back at the rules we have implemented, you will find 2 occurrences of this behaviour in the shell command: In the rule read_mapping , the index parameter -x resources/genome_indices/Scerevisiae_index In the rule reads_quantification_genes , the annotation parameter -a resources/Scerevisiae.gtf This reduces readability and also makes it very hard to change the value of these parameters. The params directive was designed for this purpose: it allows to specify additional parameters that can also depend on the wildcards values and use input functions (see Session 4 for more information on this). params values can be of any type (integer, string, list etc\u2026) and similarly to the {input} and {output} placeholders, they can also be accessed from the shell command with the placeholder arams} . Just like for the input and output directives, you can define multiple parameters (in this case, do not forget the comma between each entry!) and they can be named (in practice, unknown parameters are unexplicit and easily confusing, so parameters should always be named!). It also helps readability and clarity to use the params section to name and assign parameters and variables for your shell command. Here is an example on how to use params : rule example : input : 'data/example.tsv' output : 'results/example.txt' params : lines = 5 shell : 'head -n {params.lines} {input} > {output} ' Parameters arguments In contrast to the input directive, the params directive can optionally take more arguments than only wildcards, namely input, output, threads, and resources. Exercise: Replace the two hard-coded paths mentioned earlier by params . Hint Add a params directive to the rules, name the parameter and replace the path by the placeholder in the shell command. Answer Note: for clarity, only the lines that changed are shown below. rule read_mapping params : index = 'resources/genome_indices/Scerevisiae_index' shell : 'hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x {params.index} --threads {threads} -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} ' rule reads_quantification_genes params : annotations = 'resources/Scerevisiae.gtf' shell : 'featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF -a {params.annotations} -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} ' Snakemake re-run behaviour If you try to re-run only the last rule with snakemake --cores 4 -r -p -f results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv , Snakemake will actually try to re-run 3 rules in total. This is because the code changed in 2 rules (see reason field in Snakemake\u2019s log), which triggered an update of the inputs in the 3rd rule ( sam_to_bam ). To avoid this, first touch the files with snakemake --cores 1 --touch -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv then re-run the last rule.","title":"Non-file parameters"},{"location":"course_material/day2/4_decorating_workflow/#config-files","text":"That being said, there is an even better way to handle parameters like the we just modified: instead of hard-coding parameter values in the Snakefile, Snakemake allows to define parameters and their values in config files. The config files will be parsed by Snakemake when executing the workflow, and parameters and their values will be stored in a Python dictionary named config . The path to the config file can be specified either in the Snakefile with the line configfile: at the top of the file, or it can be specified at runtime with the execution parameter --configfile . Config files are stored in the config subfolder and written in the JSON or YAML format. We will use the latter for this course as it is the most user-friendly and the recommended one. Briefly, in the YAML format, parameters are defined with the syntax : . Values can be strings, integers, floating points, booleans \u2026 For a complete overview of available value types, see this list . A parameter can have multiple values, which are then each listed on an indented single line starting with \u201c - \u201d. These values will be stored in a Python list when Snakemake parses the config file. Finally, parameters can be nested on indented single lines, and they will be stored as a dictionary when Snakemake parses the config file. The example below shows a parameter with a single value ( lines_number ), a parameter with multiple values ( samples ), and an example of nested parameters ( resources ): # Parameter with a single value (string, int, float, bool ...) lines_number : 5 # Parameter with multiple values samples : - sample1 - sample2 # Nested parameters resources : threads : 4 memory : 4G Then, each parameter can be accessed in Snakefile with the following syntax: config [ 'lines_number' ] # --> 5 config [ 'samples' ] # --> ['sample1', 'sample2'] # Lists of parameters become list config [ 'resources' ] # --> {'threads': 4, 'memory': '4G'} # Lists of named parameters become dictionaries config [ 'resources' ][ 'threads' ] # --> 4 Accessing config values in shell Values stored in the config dictionary cannot be accessed directly within the shell directive. If you need to use a parameter value in shell , define the parameter in params and assign its value from the config dictionary. Exercise: Create a config file in YAML format and fill it with adapted variables and values to replace the 2 hard-coded parameters in rules read_mapping and reads_quantification_genes . Then replace the hard-coded parameters by values from the config file and add its path on top of your Snakefile. Answer Note: for clarity, only the lines that changed are shown below. The first step is to create the subfolder and an empty config file: mkdir config # Create a new folder touch config/config.yaml # Create an empty config file Then, fill the config file with the desired values: # Configuration options of RNAseq-analysis workflow # Location of the genome indices index : 'resources/genome_indices/Scerevisiae_index' # Location of the annotation file annotations : 'resources/Scerevisiae.gtf' Then, replace the params values in the Snakefile: rule read_mapping params : index = config [ 'index' ] rule reads_quantification_genes params : annotations = config [ 'annotations' ] Finally, add the file path on top of the Snakefile: configfile: 'config/config.yaml' Now, if we need to change these values, we can easily do it in the config file instead of modifying the code!","title":"Config files"},{"location":"course_material/day2/4_decorating_workflow/#using-non-conventional-outputs","text":"Snakemake has several built-in utilities to assign properties to outputs that are deemed \u2018special\u2019. These properties are listed in the table below: Property Syntax Function Temporary temp('path/to/file.txt') File is deleted as soon as it is not required by any future jobs Protected protected('path/to/file.txt') File cannot be overwritten after the job ends (useful to prevent erasing a file by mistake) Ancient ancient('path/to/file.txt') File will not be re-created when running the pipeline (useful for files that require heavy computation) Directory directory('path/to/directory') Output is a directory instead of a file (use \u2018touch\u2019 instead if possible) Touch touch('path/to/file.txt') Create an empty flag file \u2018file.txt\u2019 regardless of the shell command (if the command finished without errors) The next paragraphs will show how to use some of these properties.","title":"Using non-conventional outputs"},{"location":"course_material/day2/4_decorating_workflow/#use-case-of-the-temp-command","text":"Exercise: Can you think of a convenient use of temp() command? Answer The temp() command is extremely useful to automatically remove intermediary outputs that are no longer needed. Exercise: In your workflow, identify outputs that are intermediary and mark them as temporary with temp() . Answer The unsorted .bam and the .sam outputs seem like great candidates to be marked as temporary. One could also argue that the trimmed FASTQ files are also temporary, but we will keep them for now. Note: for clarity, only the lines that changed are shown below. rule read_mapping output : sam = temp ( 'results/ {sample} / {sample} _mapped_reads.sam' ), rule sam_to_bam output : bam = temp ( 'results/ {sample} / {sample} _mapped_reads.bam' ), Consequences of using temp() Removing temporary outputs is a great way to save a lot of storage space. If you look at the size of your current results/ folder ( du -bchd0 results/ ), you will notice that it drastically. Just removing these two files would allow to save ~1 GB. While it may not seem a lot, remember that you usually have much bigger files and many more samples! On the other hand, using temporary outputs might force you to re-run more jobs than necessary if an input changes, so carefully think about it before using it. Exercise: On the contrary, is there a file of your workflow that you would like to protect with protected() Answer This is debatable, but one could argue that the sorted .bam file is a good candidate for protection. rule sam_to_bam output : bam_sorted = protected ( 'results/ {sample} / {sample} _mapped_reads_sorted.bam' ), If you set this output as protected, be careful when you want to re-run your workflow and recreate the file!","title":"Use-case of the temp() command"},{"location":"course_material/day2/4_decorating_workflow/#use-case-of-the-directory-command-the-fastqc-example","text":"FastQC is a program designed to spot potential problems in high-througput sequencing datasets. It is a very popular tool, notably because it runs quickly and does not require a lot of configuration. It runs a set of analyses on one or more raw sequence files in FASTQ or BAM format and produces a report with quality plots that summarises the results. It will highlight any areas where a dataset looks unusual and might require a closer look. FastQC can be run interactively or in batch mode, during which it saves results as an HTML file and a ZIP file. We will soon see that running FastQC in batch mode presents a little problem. Data types and FastQC FastQC does not differentiate between sequencing techniques and as such can be used to look at libraries coming from a large number of experiments (Genomic Sequencing, ChIP-Seq, RNAseq, BS-Seq etc\u2026). If you run fastqc -h , you will notice something a bit surprising (but not unusual in bioinformatics): -o --outdir Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed. -d --dir Selects a directory to be used for temporary files written when generating report images. Defaults to system temp directory if not specified. Two files are produced for each FASTQ file and these files appear in the same directory as the input file: FastQC does not allow to specify the names of the output files! However, we can set an alternative output directory, even though it needs to be created before FastQC is run. There are different solutions to deal with this problem: Work with the default file names produced by FastQC and leave the reports in the same directory than the input files Create the outputs in a new directory and leave the reports with their default name Create the outputs in a new directory and tell Snakemake that the directory itself is the output Force a naming convention by renaming the FastQC output files within the rule For the sake of time, we will not test all 4 solutions, but rather try to apply the 3 rd or the 4 th solution. We\u2019ll briefly summarise solutions 1 and 2 here: This could work, but it\u2019s better not to put the reports in the same directory than the input sequences. As a general principle, when writing Snakemake rules, we prefer to be in charge of the output names and to have all the files linked to a sample in the same directory This involves manually constructing the output directory path to use with the -o option, which works but isn\u2019t very convenient The base of the FastQC command is the following: fastqc --format fastq --threads 2 -t/--threads : specify the number of files which can be processed simultaneously. Here, it will be 2 because the inputs are paired-end files The -o and -d will be used in the last 2 solutions that we will now see in details We will create a single rule to run FastQC on both the original and the trimmed FASTQ files Choose only one solution to implement: Solution 3 Solution 4 This option amounts to tell Snakemake not to worry about individual files at all and consider the output of the rule as an entire directory. Exercise: Implement a single rule to run FastQC on both the original and the trimmed FASTQ files (4 files in total) using directories as ouputs with the directory() command. Answer This makes the rule definition quite \u2018simple\u2019: rule fastq_qc_sol3 : ''' This rule performs a QC on paired-end fastq files before and after trimming. ''' input : reads1 = rules . fastq_trim . input . reads1 , reads2 = rules . fastq_trim . input . reads2 , trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : before_trim = directory ( 'results/ {sample} /fastqc_reports/before_trim/' ), after_trim = directory ( 'results/ {sample} /fastqc_reports/after_trim/' ) log : 'logs/ {sample} / {sample} _fastqc.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_fastqc.txt' resources : mem_gb = 1 threads : 2 shell : ''' echo \"Creating output directory <{output.before_trim}>\" > {log} mkdir -p {output.before_trim} 2>> {log} echo \"Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {output.before_trim} --dir {output.before_trim} {input.reads1} {input.reads2} &>> {log} echo \"Results saved in <{output.before_trim}>\" >> {log} echo \"Creating output directory <{output.after_trim}>\" >> {log} mkdir -p {output.after_trim} 2>> {log} echo \"Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {output.after_trim} --dir {output.after_trim} {input.trim1} {input.trim2} &>> {log} echo \"Results saved in <{output.after_trim}>\" >> {log} ''' .snakemake_timestamp When directory() is used, Snakemake creates an empty file called .snakemake_timestamp in the output directory. This is the marker it uses to know if it needs to re-run the rule producing the directory. Overall, this rule works well and allows for an easy rule definition. However, in this case, individual files are not explicitly named as outputs and this may cause problems to chain rules later. Also, remember that some applications won\u2019t give you any control at all over the outputs, which is why you need a back-up plan, i.e. solution 4: the most powerful solution is to use shell commands to move and/or rename the files to the names you want. Also, the Snakemake developers advise to use directory() as a last resort and to rather use the touch() flag instead. This option amounts to let FastQC follows its default behaviour but force the renaming of the files afterwards to obtain the exact outputs we require. Exercise: Implement a single rule to run FastQC on both the original and the trimmed FASTQ files (4 files in total) and rename the files created by FastQC to precise output names using the mv command. Answer This makes the rule definition (much) more complicated than the other solution: rule fastq_qc_sol4 : ''' This rule performs a QC on paired-end fastq files before and after trimming. ''' input : reads1 = rules . fastq_trim . input . reads1 , reads2 = rules . fastq_trim . input . reads2 , trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : # QC before trimming html1_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_1.html' , zipfile1_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_1.zip' , html2_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_2.html' , zipfile2_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_2.zip' , # QC after trimming html1_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_1.html' , zipfile1_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_1.zip' , html2_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_2.html' , zipfile2_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_2.zip' params : wd = 'results/ {sample} /fastqc_reports/' , # QC before trimming html1_before = 'results/ {sample} /fastqc_reports/ {sample} _1_fastqc.html' , zipfile1_before = 'results/ {sample} /fastqc_reports/ {sample} _1_fastqc.zip' , html2_before = 'results/ {sample} /fastqc_reports/ {sample} _2_fastqc.html' , zipfile2_before = 'results/ {sample} /fastqc_reports/ {sample} _2_fastqc.zip' , # QC after trimming html1_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_1_fastqc.html' , zipfile1_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_1_fastqc.zip' , html2_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_2_fastqc.html' , zipfile2_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_2_fastqc.zip' log : 'logs/ {sample} / {sample} _fastqc.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_fastqc.txt' resources : mem_gb = 1 threads : 2 shell : ''' echo \"Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {params.wd} --dir {params.wd} {input.reads1} {input.reads2} &>> {log} echo \"Renaming results from original fastq analysis\" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_before} {output.html1_before} 2>> {log} mv {params.zipfile1_before} {output.zipfile1_before} 2>> {log} mv {params.html2_before} {output.html2_before} 2>> {log} mv {params.zipfile2_before} {output.zipfile2_before} 2>> {log} echo \"Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {params.wd} --dir {params.wd} {input.trim1} {input.trim2} &>> {log} echo \"Renaming results from trimmed fastq analysis\" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_after} {output.html1_after} 2>> {log} mv {params.zipfile1_after} {output.zipfile1_after} 2>> {log} mv {params.html2_after} {output.html2_after} 2>> {log} mv {params.zipfile2_after} {output.zipfile2_after} 2>> {log} echo \"Results saved in \" >> {log} ''' This solution is very long and much more complicated than the other one. However, it makes up for the complexity by allowing a total control on what is happening: with this method, we can choose where the temporary files are saved and the names of the outputs. It could have been shortened by using -o . to tell FastQC to create the files in the current working directory instead of a specific one, but this would have created another problem: if we run multiple jobs in parallel, then Snakemake may potentially try to produce files from different jobs but with the same temporary destination. In this case, the different instances would be trying to write to the same temporary files at the same time, overwriting each other and corrupting the output files. Several interesting things are happening in both versions of this rule: Much like for the outputs, it is possible to refer to the inputs of a rule directly in another rule with the syntax rules..input. FastQC doesn\u2019t create the output directory by itself (other programs might insist that the output directory does not already exist), so we have to create it manually with mkdir in the shell command before running FastQC The -p flag of mkdir make parent directories as needed and does not return an error if the directory already exists Directory creation Remember that in most cases it is not necessary to manually create directories because Snakemake will do it for you. Even when using a directory( ) output, Snakemake will not create the directory itself but most applications will make the directory for you; FastQC is an exception. Hint If you want to make sure that a certain rule is executed before another, you can write the outputs of the first rule as inputs of the second one, even if you don\u2019t use them in the rule. For example, we could force the execution of FastQC before mapping the reads with only a few modifications to rule read_mapping : rule read_mapping: ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input: trim1 = rules.fastq_trim.output.trim1, trim2 = rules.fastq_trim.output.trim2, # Do not forget to add a comma here fastqc = rules.fastq_qc_sol4.output.html1_before # This single line will force the execution of FASTQC before read mapping output: sam = 'results/{sample}/{sample}_mapped_reads.sam', report = 'results/{sample}/{sample}_mapping_report.txt' params: index = 'resources/genome_indices/Scerevisiae_index' log: 'logs/{sample}/{sample}_mapping.log' benchmark: 'benchmarks/{sample}/{sample}_mapping.txt' resources: mem_gb = 2 threads: 4 shell: ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal -x {params.index} --threads {threads} -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} '''","title":"Use-case of the directory() command: the FastQC example"},{"location":"course_material/day2/4_decorating_workflow/#modularising-a-workflow","text":"If you keep developing a workflow long enough, you are bound to encounter some cluttering problems. Have a look at your current Snakefile: with only 5 rules, it is already almost 200 lines long. Imagine what happens when your workflow comprises dozens of rules?! The Snakefile may become messy and harder to maintain and edit. This is why it quickly becomes crucial to modularise your workflow; this is a common practice in programming in general. This approach also makes it easier to re-use pieces of workflow in the future. Modularisation comes at 4 different levels: The most fine-grained level are wrappers. Wrappers allow to quickly use popular tools and libraries in Snakemake workflows, thanks to the wrapper directive. Wrappers are automatically downloaded and deploy a conda environment when running the workflow, which increases reproducibility, however their implementation can sometimes be \u2018rigid\u2019 and you may have to write your own rule. See the official documentation for more explanations For larger, reusable parts belonging to the same workflow, it is recommended to write smaller snakefiles and include them into a main Snakefile with the include statement. Note that in this case, all rules share a common config file. See the official documentation for more explanations The next level of modularisation is provided via the module statement, which enables arbitrary combination and re-use of rules in the same workflow and between workflows. See the official documentation for more explanations Finally, Snakemake also provides a syntax to define subworkflows, but this syntax is currently being deprecated in favor of the module statement. See the official documentation for more explanations In this course, we will only use the 2 nd level of modularisation. In more details, the idea is to write a main Snakefile in workflow/Snakefile , to place the other snakefiles containing the rules in the subfolder workflow/rules (these \u2018sub-Snakefile\u2019 should end with .smk , the recommended file extension of Snakemake) and to tell Snakemake to import the modular snakefiles in the main Snakefile with the include: syntax. Rules organisation How to organize rules is up to you, but a common approach would be to create \u201cthematic\u201d modules, i.e. regroup rules involved in the same general step of the workflow. Exercise: Move your current Snakefile into the subfolder workflow/rules and rename it to read_mapping.smk . Then create a new Snakefile in workflow/ and import read_mapping.smk in it using the include syntax. You should also move the importation of the config file from the modular Snakefile to the main one. Answer We will solve this problem step by step. First, create the new file structure: mkdir workflow/rules # Create a new folder mv workflow/Snakefile workflow/rules/read_mapping.smk # Move and rename the modular snakefile touch workflow/Snakefile # Recreate the main Snakefile Then, fill the main Snakefile with include and configfile : ''' Main Snakefile of the RNAseq analysis workflow. This workflow can clean and map reads, and perform Differential Expression Analyses. ''' # Path of the config file configfile : 'config/config.yaml' # Rules to execute the workflow include : 'rules/read_mapping.smk' Finally, do not forget to remove the config file import ( configfile: 'config/config.yaml' ) from the snakefiles ( workflow/rules/read_mapping.smk ) Relative paths Includes are relative to the directory of the Snakefile in which they occur. For example, if the Snakefile resides in workflow , then Snakemake will search for the included snakefiles in workflow/path/to/other/snakefile , regardless of the working directory You can place snakefiles in a sub-directory without changing input and output paths, as these paths are relative to the working directory. However, you will need to edit paths to external scripts and conda environments, as these paths are relative to the snakefile from which they are called (this will be discussed in the last series of exercises) In practice, you can imagine that the line include: is replaced by the entire content of snakefile.smk in Snakefile . This means that syntaxes like rules..output. can still be used in snakefiles, even if the rule was defined in another snakefile, as long as the snakefile in which is defined is included before the snakefile that uses rules..output . This also works for input and output functions.","title":"Modularising a workflow"},{"location":"course_material/day2/4_decorating_workflow/#using-a-target-rule-and-aggregating-outputs","text":"","title":"Using a target rule and aggregating outputs"},{"location":"course_material/day2/4_decorating_workflow/#creating-a-target-rule","text":"Modularisation also offers a great opportunity to facilitate the execution of the workflow. By default, if no target is given at the command line, Snakemake executes the first rule in the Snakefile. Hence, we have always executed the workflow by specifying a target file in the command line to avoid this behaviour. But we can actually use this property to make the execution easier by writing a pseudo-rule (also called target-rule and usually named rule all ) in the Snakefile which has all the desired outputs (or a particular subsets of them) files as input files. This rule will look like this: rule all : input : 'path/to/ouput1' , 'path/to/ouput2' Order of rules in Snakefile/snakefiles Apart from Snakemake considering the first rule of the workflow as the default target, the order of rules in the Snakefile/snakefiles is arbitrary and does not influence the DAG of jobs. Exercise: Implement a special rule in the Snakefile so that the final output is generated by default when running snakemake without specifying a target, then test your workflow with a dry-run. Hint Remember that a rule is not required to have an output nor a shell command The inputs of rule all should be the final outputs that you want to generate (those from the last rule you wrote) Answer If we consider that the last outputs are the ones produced by rule reads_quantification_genes , we can write the target rule like this: # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : 'results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv' Note that we used only one of the two outputs of rule reads_quantification_genes . We do this because it is enough to trigger the execution and if the rule didn\u2019t produce both outputs, Snakemake would crash and report it this error. Now, let\u2019s try to do a dry-run with this new rule: snakemake --cores 4 -F -r -p -n . You should see all the rules appearing thanks to the -F flag, including: localrule all: input: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv jobid: 0 reason: Input files updated by another job: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv resources: tmpdir = /tmp Job stats: job count -------------------------- ------- all 1 fastq_qc_sol4 1 fastq_trim 1 read_mapping 1 reads_quantification_genes 1 sam_to_bam 1 total 6","title":"Creating a target rule"},{"location":"course_material/day2/4_decorating_workflow/#aggregating-outputs","text":"Using a target rule like the one presented in the previous paragraph gives another opportunity to make things easier. In the rule we just created, we used a hard-coded input and by now, you should know that this is not an optimal solution and that we should avoid this as much as possible, especially if you have many samples to process. To solve this problem, we will rely on the expand function . Exercise: Write an expand() syntax to generate a list of outputs from rule reads_quantification_genes with all the RNAseq samples . What do you need to write this? Answer The output of rule reads_quantification_genes has the following syntax: 'results/{sample}/{sample}_genes_read_quantification.tsv' . First, we need to create a Python list containing all the values that the {sample} wildcards can take: SAMPLES = ['highCO2_sample1', 'highCO2_sample2', 'highCO2_sample3', 'lowCO2_sample1', 'lowCO2_sample2', 'lowCO2_sample3'] Then, we can transform the output syntax with expand() : expand('results/{sample}/{sample}_genes_read_quantification.tsv', sample=SAMPLES) Exercise: Use these two elements (the list of samples and the expand() syntax) in the target rule to ask Snakemake to generate all the outputs. Answer You need to add the sample list to the Snakefile before the rule all and replace the value of the input directive: # Sample list SAMPLES = [ 'highCO2_sample1' , 'highCO2_sample2' , 'highCO2_sample3' , 'lowCO2_sample1' , 'lowCO2_sample2' , 'lowCO2_sample3' ] # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : expand ( 'results/ {sample} / {sample} _genes_read_quantification.tsv' , sample = SAMPLES ) If you launch the workflow in dry-run mode with this new rule: snakemake --cores 4 -F -r -p -n . You should see all the rules appearing 5 times (1 for each sample that hasn\u2019t been processed yet): localrule all: input: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv, results/highCO2_sample2/highCO2_sample2_genes_read_quantification.tsv, results/highCO2_sample3/highCO2_sample3_genes_read_quantification.tsv, results/lowCO2_sample1/lowCO2_sample1_genes_read_quantification.tsv, results/lowCO2_sample2/lowCO2_sample2_genes_read_quantification.tsv, results/lowCO2_sample3/lowCO2_sample3_genes_read_quantification.tsv jobid: 0 reason: Input files updated by another job: results/lowCO2_sample1/lowCO2_sample1_genes_read_quantification.tsv, results/lowCO2_sample2/lowCO2_sample2_genes_read_quantification.tsv, results/lowCO2_sample3/lowCO2_sample3_genes_read_quantification.tsv, results/highCO2_sample3/highCO2_sample3_genes_read_quantification.tsv, results/highCO2_sample2/highCO2_sample2_genes_read_quantification.tsv resources: tmpdir = /tmp Job stats: job count -------------------------- ------- all 1 fastq_qc_sol4 5 fastq_trim 5 read_mapping 5 reads_quantification_genes 5 sam_to_bam 5 total 26 But we can do even better! At the moment, samples are defined in a list at the top of the Snakefile. To further improve the workflow\u2019s usability, we can define samples in the config file, so they can easily be added, removed, or modified by the user. Exercise: Implement a parameter in the config file to specify sample names and modify rule all to use this parameter in the expand() syntax. Answer First, we need to modify the config file: # Configuration options of RNAseq-analysis workflow # Location of the genome indices index : 'resources/genome_indices/Scerevisiae_index' # Location of the annotation file annotations : 'resources/Scerevisiae.gtf' # Sample names samples : - highCO2_sample1 - highCO2_sample2 - highCO2_sample3 - lowCO2_sample1 - lowCO2_sample2 - lowCO2_sample3 Then, we need to use the config file in the expand() syntax (and remove SAMPLES from the Snakefile, because we don\u2019t need this variable anymore): # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : expand ( 'results/ {sample} / {sample} _genes_read_quantification.tsv' , sample = config [ 'samples' ]) Here, config['samples'] is a Python list containing strings, each string being a sample name. This is because a list of parameters become a list during the config file parsing. An even more Snakemake-idiomatic solution There is an even better and more Snakemake-idiomatic version of the expand() syntax: expand(rules.reads_quantification_genes.output.gene_level, sample=config['samples']) . While it may not seem easy to use and understand, this entirely removes the need to write the output paths!","title":"Aggregating outputs"},{"location":"course_material/day2/4_decorating_workflow/#running-the-other-samples-of-the-workflow","text":"Exercise: Touch the files already present in your workflow to avoid re-creating them and then run your workflow on the 5 other samples. Answer To touch the existing files, you can use: snakemake --cores 1 --touch To run the workflow, you can use snakemake --cores 4 -r -p Thanks to the parallelisation, the workflow execution should take less than 10 min in total to process all the samples! Exercise: Generate the workflow DAG and filegraph. Answer Generate the DAG: snakemake --cores 1 -F -r -p --rulegraph | dot -Tpng > images/all_samples_rulegraph.png Generate the filegraph: snakemake --cores 1 -F -r -p --filegraph | dot -Tpng > images/all_samples_filegraph.png Your DAG should resemble this: And your filegraph, this:","title":"Running the other samples of the workflow"}]} \ No newline at end of file +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Course website Teachers Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Antonin Thi\u00e9baut .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Damir Zhakparov Authors Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Antonin Thi\u00e9baut .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Attribution This course is partly inspired by the Carpentries Docker course and the official Snakemake tutorial . License & copyright License: CC BY-SA 4.0 Copyright: SIB Swiss Institute of Bioinformatics Material This website Google doc (through mail) Learning outcomes General learning outcomes After this course, you will be able to: Understand the basic concepts and terminology associated with virtualization with containers Customize, store, manage and share containerized environments with Docker Use Singularity to run containers on a shared computer environment (e.g. a HPC cluster) Understand the basic concepts and terminology associated with workflow management systems Create a computational workflow that uses containers and package managers with Snakemake Learning outcomes explained To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn. Learning experiences To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only. Exercises Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different. Asking questions During lectures, you are encouraged to raise your hand if you have questions. A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teachers will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand Personal interest questions: #background on slack During exercises: raise hand/ #q-and-a on slack","title":"Home"},{"location":"#course-website","text":"","title":"Course website"},{"location":"#teachers","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Antonin Thi\u00e9baut .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Damir Zhakparov","title":"Teachers"},{"location":"#authors","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Antonin Thi\u00e9baut .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Authors"},{"location":"#attribution","text":"This course is partly inspired by the Carpentries Docker course and the official Snakemake tutorial .","title":"Attribution"},{"location":"#license-copyright","text":"License: CC BY-SA 4.0 Copyright: SIB Swiss Institute of Bioinformatics","title":"License & copyright"},{"location":"#material","text":"This website Google doc (through mail)","title":"Material"},{"location":"#learning-outcomes","text":"","title":"Learning outcomes"},{"location":"#general-learning-outcomes","text":"After this course, you will be able to: Understand the basic concepts and terminology associated with virtualization with containers Customize, store, manage and share containerized environments with Docker Use Singularity to run containers on a shared computer environment (e.g. a HPC cluster) Understand the basic concepts and terminology associated with workflow management systems Create a computational workflow that uses containers and package managers with Snakemake","title":"General learning outcomes"},{"location":"#learning-outcomes-explained","text":"To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn.","title":"Learning outcomes explained"},{"location":"#learning-experiences","text":"To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.","title":"Learning experiences"},{"location":"#exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"#asking-questions","text":"During lectures, you are encouraged to raise your hand if you have questions. A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teachers will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand Personal interest questions: #background on slack During exercises: raise hand/ #q-and-a on slack","title":"Asking questions"},{"location":"course_schedule/","text":"Day 1 - Containers Block Start End subject Block 1 9:00 AM 10:30 AM Introduction to containers 10:30 AM 11:00 AM BREAK Block 2 11:00 AM 12:30 PM Managing containers and images 12:30 PM 1:30 PM BREAK Block 3 1:30 PM 3:00 PM Working with dockerfiles 3:00 PM 3:30 PM BREAK Block 4 3:30 PM 5:00 PM Running containers with singularity Day 2 - Snakemake Block Start End subject Block 1 9:00 AM 10:30 AM Introduction to Snakemake 10:30 AM 11:00 AM BREAK Block 2 11:00 AM 12:30 PM Generalising a Snakemake workflow 12:30 PM 1:30 PM BREAK Block 3 1:30 PM 3:00 PM Decorating a Snakemake workflow 3:00 PM 3:30 PM BREAK Block 4 3:30 PM 4:30 PM Snakemake, package managers and containers 4:30 PM 5:00 PM Wrap-up & Open Q&A","title":"Course schedule"},{"location":"course_schedule/#day-1-containers","text":"Block Start End subject Block 1 9:00 AM 10:30 AM Introduction to containers 10:30 AM 11:00 AM BREAK Block 2 11:00 AM 12:30 PM Managing containers and images 12:30 PM 1:30 PM BREAK Block 3 1:30 PM 3:00 PM Working with dockerfiles 3:00 PM 3:30 PM BREAK Block 4 3:30 PM 5:00 PM Running containers with singularity","title":"Day 1 - Containers"},{"location":"course_schedule/#day-2-snakemake","text":"Block Start End subject Block 1 9:00 AM 10:30 AM Introduction to Snakemake 10:30 AM 11:00 AM BREAK Block 2 11:00 AM 12:30 PM Generalising a Snakemake workflow 12:30 PM 1:30 PM BREAK Block 3 1:30 PM 3:00 PM Decorating a Snakemake workflow 3:00 PM 3:30 PM BREAK Block 4 3:30 PM 4:30 PM Snakemake, package managers and containers 4:30 PM 5:00 PM Wrap-up & Open Q&A","title":"Day 2 - Snakemake"},{"location":"precourse/","text":"UNIX As is stated in the course prerequisites at the announcement web page . We expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial . Software Install Docker on your local computer and create an account on dockerhub . You can find instructions here . Note that you need admin rights to install and use Docker, and if you are installing Docker on Windows, you need a recent Windows version. You should also have a modern code editor installed, like Sublime Text or VScode . If working with Windows During the course exercises you will be mainly interacting with docker through the command line. Although windows powershell is suitable for that, it is easier to follow the exercises if you have UNIX or \u2018UNIX-like\u2019 terminal. You can get this by using WSL2 . Make sure you install the latest versions before installing docker. If installing Docker is a problem During the course, we can give only limited support for installation issues. If you do not manage to install Docker before the course, you can still do almost all exercises on Play with Docker . A Docker login is required. In addition to your local computer, we will be working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through ssh with a username, key and IP address. All participants will be granted access to a personal home directory.","title":"Precourse preparations"},{"location":"precourse/#unix","text":"As is stated in the course prerequisites at the announcement web page . We expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial .","title":"UNIX"},{"location":"precourse/#software","text":"Install Docker on your local computer and create an account on dockerhub . You can find instructions here . Note that you need admin rights to install and use Docker, and if you are installing Docker on Windows, you need a recent Windows version. You should also have a modern code editor installed, like Sublime Text or VScode . If working with Windows During the course exercises you will be mainly interacting with docker through the command line. Although windows powershell is suitable for that, it is easier to follow the exercises if you have UNIX or \u2018UNIX-like\u2019 terminal. You can get this by using WSL2 . Make sure you install the latest versions before installing docker. If installing Docker is a problem During the course, we can give only limited support for installation issues. If you do not manage to install Docker before the course, you can still do almost all exercises on Play with Docker . A Docker login is required. In addition to your local computer, we will be working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through ssh with a username, key and IP address. All participants will be granted access to a personal home directory.","title":"Software"},{"location":"course_material/day1/dockerfiles/","text":"Learning outcomes After having completed this chapter you will be able to: Build an image based on a dockerfile Use the basic dockerfile syntax Change the default command of an image and validate the change Map ports to a container to display interactive content through a browser Material Official Dockerfile reference Ten simple rules for writing dockerfiles Exercises To make your images shareable and adjustable, it\u2019s good practice to work with a Dockerfile . This is a script with a set of instructions to build your image from an existing image. Basic Dockerfile You can generate an image from a Dockerfile using the command docker build . A Dockerfile has its own syntax for giving instructions. Luckily, they are rather simple. The script always contains a line starting with FROM that takes the image name from which the new image will be built. After that you usually want to run some commands to e.g. configure and/or install software. The instruction to run these commands during building starts with RUN . In our figlet example that would be: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet On writing reproducible Dockerfiles At the FROM statement in the above Dockerfile you see that we have added a specific tag to the image (i.e. jammy-20230308 ). We could also have written: FROM ubuntu RUN apt-get update RUN apt-get install figlet This will automatically pull the image with the tag latest . However, if the maintainer of the ubuntu images decides to tag another ubuntu version as latest , rebuilding with the above Dockerfile will not give you the same result. Therefore it\u2019s always good practice to add the (stable) tag to the image in a Dockerfile . More rules on making your Dockerfiles more reproducible here . Exercise: Create a file on your computer called Dockerfile , and paste the above instruction lines in that file. Make the directory containing the Dockerfile your current directory. Build a new image based on that Dockerfile with: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build . docker build --platform amd64 . If using an Apple M1 chip (newer Macs) If you are using a computer with an Apple M1 chip, you have the less common ARM system architecture, which can limit transferability of images to (more common) x86_64/AMD64 machines. When building images on a Mac with an M1 chip (especially if you have sharing in mind), it\u2019s best to specify the --platform amd64 flag. The argument of docker build The command docker build takes a directory as input (providing . means the current directory). This directory should contain the Dockerfile , but it can also contain more of the build context, e.g. (python, R, shell) scripts that are required to build the image. What has happened? What is the name of the build image? Answer A new image was created based on the Dockerfile . You can check it with: docker image ls , which gives something like: REPOSITORY TAG IMAGE ID CREATED SIZE 92c980b09aad 7 seconds ago 101MB ubuntu-figlet latest e08b999c7978 About an hour ago 101MB ubuntu latest f63181f19b2f 30 hours ago 72.9MB It has created an image without a name or tag. That\u2019s a bit inconvenient. Exercise: Build a new image with a specific name. You can do that with adding the option -t to docker build . Before that, remove the nameless image. Hint An image without a name is usually a \u201cdangling image\u201d. You can remove those with docker image prune . Answer Remove the nameless image with docker image prune . After that, rebuild an image with a name: x86_64 / AMD64 ARM (MacOS M1 chip) docker build -t ubuntu-figlet:v2 . docker build --platform amd64 -t ubuntu-figlet:v2 . Using CMD As you might remember the second positional argument of docker run is a command (i.e. docker run IMAGE [CMD] ). If you leave it empty, it uses the default command. You can change the default command in the Dockerfile with an instruction starting with CMD . For example: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD figlet My image works! Exercise: Build a new image based on the above Dockerfile . Can you validate the change using docker image inspect ? Can you overwrite this default with docker run ? Answer Copy the new line to your Dockerfile , and build the new image like this: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t ubuntu-figlet:v3 . docker build --platform amd64 -t ubuntu-figlet:v3 . The command docker inspect ubuntu-figlet:v3 will give: \"Cmd\": [ \"/bin/sh\", \"-c\", \"figlet My image works!\" ] So the default command ( /bin/bash ) has changed to figlet My image works! Running the image (with clean-up ( --rm )): docker run --rm ubuntu-figlet:v3 Will result in: __ __ _ _ _ | \\/ |_ _ (_)_ __ ___ __ _ __ _ ___ __ _____ _ __| | _____| | | |\\/| | | | | | | '_ ` _ \\ / _` |/ _` |/ _ \\ \\ \\ /\\ / / _ \\| '__| |/ / __| | | | | | |_| | | | | | | | | (_| | (_| | __/ \\ V V / (_) | | | <\\__ \\_| |_| |_|\\__, | |_|_| |_| |_|\\__,_|\\__, |\\___| \\_/\\_/ \\___/|_| |_|\\_\\___(_) |___/ |___/ And of course you can overwrite the default command: docker run --rm ubuntu-figlet:v3 figlet another text Resulting in: _ _ _ _ __ _ _ __ ___ | |_| |__ ___ _ __ | |_ _____ _| |_ / _` | '_ \\ / _ \\| __| '_ \\ / _ \\ '__| | __/ _ \\ \\/ / __| | (_| | | | | (_) | |_| | | | __/ | | || __/> <| |_ \\__,_|_| |_|\\___/ \\__|_| |_|\\___|_| \\__\\___/_/\\_\\\\__| Two flavours of CMD You have seen in the output of docker inspect that docker translates the command (i.e. figlet \"my image works!\" ) into this: [\"/bin/sh\", \"-c\", \"figlet 'My image works!'\"] . The notation we used in the Dockerfile is the shell notation while the notation with the square brackets ( [] ) is the exec-notation . You can use both notations in your Dockerfile . Altough the shell notation is more readable, the exec notation is directly used by the image, and therefore less ambiguous. A Dockerfile with shell notation: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD figlet My image works! A Dockerfile with exec notation: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD [ \"/bin/sh\" , \"-c\" , \"figlet My image works!\" ] Exercise: Now push our created image (with a version tag) to docker hub. We will use it later for the singularity exercises . Answer docker tag ubuntu-figlet:v3 [ USER NAME ] /ubuntu-figlet:v3 docker push [ USER NAME ] /ubuntu-figlet:v3 Build an image for your own script Often containers are built for a specific purpose. For example, you can use a container to ship all dependencies together with your developed set of scripts/programs. For that you will need to add your scripts to the container. That is quite easily done with the instruction COPY . However, in order to make your container more user-friendly, there are several additional instructions that can come in useful. We will treat the most frequently used ones below. Depending on your preference, either choose R or Python below. In the exercises will use a script called test_deseq2.R . This script will: Load the DESeq2 and optparse packages Load some additional packages to test their installations. We will use those packages later on in the course. Create and parse an option called --rows with optparse Create a dummy count matrix Run DESeq2 on the dummy count matrix Print the results to stdout You can download it here , or copy-paste it: test_deseq2.R #!/usr/bin/env Rscript # load packages required for this script write ( \"Loading packages required for this script\" , stderr ()) suppressPackageStartupMessages ({ library ( DESeq2 ) library ( optparse ) }) # load dependency packages for testing installations write ( \"Loading dependency packages for testing installations\" , stderr ()) suppressPackageStartupMessages ({ library ( apeglm ) library ( IHW ) library ( limma ) library ( data.table ) library ( ggplot2 ) library ( ggrepel ) library ( pheatmap ) library ( RColorBrewer ) library ( scales ) library ( stringr ) }) # parse options with optparse option_list <- list ( make_option ( c ( \"--rows\" ), type = \"integer\" , help = \"Number of rows in dummy matrix [default = %default]\" , default = 100 ) ) opt_parser <- OptionParser ( option_list = option_list , description = \"Runs DESeq2 on dummy data\" ) opt <- parse_args ( opt_parser ) # create a random dummy count matrix cnts <- matrix ( rnbinom ( n = opt $ row * 10 , mu = 100 , size = 1 / 0.5 ), ncol = 10 ) cond <- factor ( rep ( 1 : 2 , each = 5 )) # object construction dds <- DESeqDataSetFromMatrix ( cnts , DataFrame ( cond ), ~ cond ) # standard analysis dds <- DESeq ( dds ) res <- results ( dds ) # print results to stdout print ( res ) After you have downloaded it, make sure to set the permissions to executable: chmod +x test_deseq2.R It is a relatively simple script that runs DESeq2 on a dummy dataset. An example for execution would be: ./test_deseq2.R --rows 100 Here, --rows is a optional arguments that specifies the number of rows generated in the input count matrix. When running the script, it will return a bunch of messages and at the end an overview of differential gene expression analysis results: baseMean log2FoldChange lfcSE stat pvalue padj 1 66.1249 0.281757 0.727668 0.387206 0.698604 0.989804 2 76.9682 0.305763 0.619209 0.493796 0.621451 0.989804 3 64.7843 -0.694525 0.479445 -1.448603 0.147448 0.931561 4 123.0252 0.631247 0.688564 0.916758 0.359269 0.931561 5 93.2002 -0.453430 0.686043 -0.660936 0.508653 0.941951 ... ... ... ... ... ... ... 96 64.0177 0.757585137 0.682683 1.109718054 0.267121 0.931561 97 114.3689 -0.580010850 0.640313 -0.905823841 0.365029 0.931561 98 79.9620 0.000100617 0.612442 0.000164288 0.999869 0.999869 99 92.6614 0.563514308 0.716109 0.786910869 0.431334 0.939106 100 96.4410 -0.155268696 0.534400 -0.290547708 0.771397 0.989804 From the script you can see it has DESeq2 and optparse as dependencies. If we want to run the script inside a container, we would have to install them. We do this in the Dockerfile below. We give it the following instructions: use the r2u base image version jammy install the package DESeq2 , optparse and some additional packages we will need later on. We perform the installations with install2.r , which is a helper command that is present inside most rocker images. More info here . copy the script test_deseq2.R to /opt inside the container: FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt Note In order to use COPY , the file that needs to be copied needs to be in the same directory as the Dockerfile or one of its subdirectories. R image stack The most used R image stack is from the rocker project . It contains many different base images (e.g. with shiny, Rstudio, tidyverse etc.). It depends on the type of image whether installations with apt-get or install2.r are possible. To understand more about how to install R packages in different containers, check it this cheat sheet , or visit rocker-project.org . Exercise: Download the test_deseq2.R and build the image with docker build . Name the image deseq2 . After that, start an interactive session and execute the script inside the container. Hint Make an interactive session with the options -i and -t and use /bin/bash as the command. Answer Build the container: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t deseq2 . docker build --platform amd64 -t deseq2 . Run the container: docker run -it --rm deseq2 /bin/bash Inside the container we look up the script: cd /opt ls This should return test_deseq2.R . Now you can execute it from inside the container: ./test_deseq2.R --rows 100 That\u2019s kind of nice. We can ship our R script inside our container. However, we don\u2019t want to run it interactively every time. So let\u2019s make some changes to make it easy to run it as an executable. For example, we can add /opt to the global $PATH variable with ENV . The $PATH variable The path variable is a special variable that consists of a list of path seperated by colons ( : ). These paths are searched if you are trying to run an executable. More info this topic at e.g. wikipedia . FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt ENV PATH = /opt: $PATH Note The ENV instruction can be used to set any variable. Exercise : Rebuild the image and start an interactive bash session inside the new image. Is the path variable updated? (i.e. can we execute test_deseq2.R from anywhere?) Answer After re-building we start an interactive session: docker run -it --rm deseq2 /bin/bash The path is upated, /opt is appended to the beginning of the variable: echo $PATH returns: /opt:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin Now you can try to execute it from the root directory (or any other): test_deseq2.R Instead of starting an interactive session with /bin/bash we can now more easily run the script non-interactively: docker run --rm deseq2 test_deseq2.R --rows 100 Now it will directly print the output of test_deseq2.R to stdout. In the case you want to pack your script inside a container, you are building a container specifically for your script, meaning you almost want the container to behave as the program itself. In order to do that, you can use ENTRYPOINT . ENTRYPOINT is similar to CMD , but has two important differences: ENTRYPOINT can not be overwritten by the positional arguments (i.e. docker run image [CMD] ), but has to be overwritten by --entrypoint . The positional arguments (or CMD ) are pasted to the ENTRYPOINT command. This means that you can use ENTRYPOINT as the executable and the positional arguments (or CMD ) as the options. Let\u2019s try it out: FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] Exercise : Re-build, and run the container non-interactively without any positional arguments. After that, try to pass a different number of rows to --rows . How do the commands look? Answer Just running the container non-interactively would be: docker run --rm deseq2 Passing a different argument (i.e. overwriting CMD ) would be: docker run --rm deseq2 --rows 200 Here, the container behaves as the executable itself to which you can pass arguments. Most containerized applications need multiple build steps. Often, you want to perform these steps and executions in a specific directory. Therefore, it can be in convenient to specify a working directory. You can do that with WORKDIR . This instruction will set the default directory for all other instructions (like RUN , COPY etc.). It will also change the directory in which you will land if you run the container interactively. FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr WORKDIR /opt COPY test_deseq2.R . ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] Exercise : build the image, and start the container interactively. Has the default directory changed? After that, push the image to dockerhub, so we can use it later with the singularity exercises. Note You can overwrite ENTRYPOINT with --entrypoint as an argument to docker run . Answer Running the container interactively would be: docker run -it --rm --entrypoint /bin/bash deseq2 Which should result in a terminal looking something like this: root@9a27da455fb1:/opt# Meaning that indeed the default directory has changed to /opt Pushing it to dockerhub: docker tag deseq2 [ USER NAME ] /deseq2:v1 docker push [ USER NAME ] /deseq2:v1 Get information on your image with docker inspect We have used docker inspect already in the previous chapter to find the default Cmd of the ubuntu image. However we can get more info on the image: e.g. the entrypoint, environmental variables, cmd, workingdir etc., you can use the Config record from the output of docker inspect . For our image this looks like: \"Config\" : { \"Hostname\" : \"\" , \"Domainname\" : \"\" , \"User\" : \"\" , \"AttachStdin\" : false , \"AttachStdout\" : false , \"AttachStderr\" : false , \"Tty\" : false , \"OpenStdin\" : false , \"StdinOnce\" : false , \"Env\" : [ \"PATH=/opt:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\" , \"LC_ALL=en_US.UTF-8\" , \"LANG=en_US.UTF-8\" , \"DEBIAN_FRONTEND=noninteractive\" , \"TZ=UTC\" ], \"Cmd\" : [ \"--rows\" , \"100\" ], \"ArgsEscaped\" : true , \"Image\" : \"\" , \"Volumes\" : null , \"WorkingDir\" : \"/opt\" , \"Entrypoint\" : [ \"test_deseq2.R\" ], \"OnBuild\" : null , \"Labels\" : { \"maintainer\" : \"Dirk Eddelbuettel \" , \"org.label-schema.license\" : \"GPL-2.0\" , \"org.label-schema.vcs-url\" : \"https://github.com/rocker-org/\" , \"org.label-schema.vendor\" : \"Rocker Project\" } } Adding metadata to your image You can annotate your Dockerfile and the image by using the instruction LABEL . You can give it any key and value with = . However, it is recommended to use the Open Container Initiative (OCI) keys . Exercise : Annotate our Dockerfile with the OCI keys on the creation date, author and description. After that, check whether this has been passed to the actual image with docker inspect . Note You can type LABEL for each key-value pair, but you can also have it on one line by seperating the key-value pairs by a space, e.g.: LABEL keyx = \"valuex\" keyy = \"valuey\" Answer The Dockerfile would look like: FROM rocker/r2u:jammy LABEL org.opencontainers.image.created = \"2023-04-12\" \\ org.opencontainers.image.authors = \"Geert van Geest\" \\ org.opencontainers.image.description = \"Container with DESeq2 and friends\" RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr WORKDIR /opt COPY test_deseq2.R . ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] The Config record in the output of docker inspect was updated with: \"Labels\" : { \"org.opencontainers.image.authors\" : \"Geert van Geest\" , \"org.opencontainers.image.created\" : \"2023-04-12\" , \"org.opencontainers.image.description\" : \"Container with DESeq2 and friends\" , \"org.opencontainers.image.licenses\" : \"GPL-2.0-or-later\" , \"org.opencontainers.image.source\" : \"https://github.com/rocker-org/rocker\" , \"org.opencontainers.image.vendor\" : \"Rocker Project\" } Building an image with a browser interface In this exercise, we will use a different base image ( rocker/rstudio:4 ), and we\u2019ll install the same packages. Rstudio server is a nice browser interface that you can use for a.o. programming in R. With the image we are creating we will be able to run Rstudio server inside a container. Check out the Dockerfile : FROM rocker/rstudio:4 RUN apt-get update && \\ apt-get install -y libz-dev RUN install2.r \\ optparse \\ BiocManager RUN R -q -e 'BiocManager::install(\"biomaRt\")' This will create an image from the existing rstudio image. It will also install libz-dev with apt-get , BiocManager with install2.r and DESeq2 with an R command. Despite we\u2019re installing the same packages, the installation steps need to be different from the r-base image. This is because in the rocker/rstudio images R is installed from source, and therefore you can\u2019t install packages with apt-get . More information on how to install R packages in R containers in this cheat sheet , or visit rocker-project.org . Installation will take a while The installation of CRAN packages will go relatively quickly, because can use the binary packages supplied by Posit Public Package Manager . However, the installation of Bioconductor packages will take a while, because they need to be installed from source. If you don\u2019t have time, you can skip the DESeq2 installation by removing the last line of the Dockerfile . Exercise: Build an image based on this Dockerfile and give it a meaningful name. Answer x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t rstudio-server . docker build --platform amd64 -t rstudio-server . You can now run a container from the image. However, you will have to tell docker where to publish port 8787 from the docker container with -p [HOSTPORT:CONTAINERPORT] . We choose to publish it to the same port number: docker run --rm -it -p 8787 :8787 rstudio-server Networking More info on docker container networking here By running the above command, a container will be started exposing rstudio server at port 8787 at localhost. You can approach the instance of Rstudio server by typing localhost:8787 in your browser. You will be asked for a password. You can find this password in the terminal from which you have started the container. We can make this even more interesting by mounting a local directory to the container running the Rstudio image: docker run \\ -it \\ --rm \\ -p 8787 :8787 \\ --mount type = bind,source = /Users/myusername/working_dir,target = /home/rstudio/working_dir \\ rstudio-server By doing this you have a completely isolated and shareable R environment running Rstudio server, but with your local files available to it. Pretty neat right?","title":"Working with dockerfiles"},{"location":"course_material/day1/dockerfiles/#learning-outcomes","text":"After having completed this chapter you will be able to: Build an image based on a dockerfile Use the basic dockerfile syntax Change the default command of an image and validate the change Map ports to a container to display interactive content through a browser","title":"Learning outcomes"},{"location":"course_material/day1/dockerfiles/#material","text":"Official Dockerfile reference Ten simple rules for writing dockerfiles","title":"Material"},{"location":"course_material/day1/dockerfiles/#exercises","text":"To make your images shareable and adjustable, it\u2019s good practice to work with a Dockerfile . This is a script with a set of instructions to build your image from an existing image.","title":"Exercises"},{"location":"course_material/day1/dockerfiles/#basic-dockerfile","text":"You can generate an image from a Dockerfile using the command docker build . A Dockerfile has its own syntax for giving instructions. Luckily, they are rather simple. The script always contains a line starting with FROM that takes the image name from which the new image will be built. After that you usually want to run some commands to e.g. configure and/or install software. The instruction to run these commands during building starts with RUN . In our figlet example that would be: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet On writing reproducible Dockerfiles At the FROM statement in the above Dockerfile you see that we have added a specific tag to the image (i.e. jammy-20230308 ). We could also have written: FROM ubuntu RUN apt-get update RUN apt-get install figlet This will automatically pull the image with the tag latest . However, if the maintainer of the ubuntu images decides to tag another ubuntu version as latest , rebuilding with the above Dockerfile will not give you the same result. Therefore it\u2019s always good practice to add the (stable) tag to the image in a Dockerfile . More rules on making your Dockerfiles more reproducible here . Exercise: Create a file on your computer called Dockerfile , and paste the above instruction lines in that file. Make the directory containing the Dockerfile your current directory. Build a new image based on that Dockerfile with: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build . docker build --platform amd64 . If using an Apple M1 chip (newer Macs) If you are using a computer with an Apple M1 chip, you have the less common ARM system architecture, which can limit transferability of images to (more common) x86_64/AMD64 machines. When building images on a Mac with an M1 chip (especially if you have sharing in mind), it\u2019s best to specify the --platform amd64 flag. The argument of docker build The command docker build takes a directory as input (providing . means the current directory). This directory should contain the Dockerfile , but it can also contain more of the build context, e.g. (python, R, shell) scripts that are required to build the image. What has happened? What is the name of the build image? Answer A new image was created based on the Dockerfile . You can check it with: docker image ls , which gives something like: REPOSITORY TAG IMAGE ID CREATED SIZE 92c980b09aad 7 seconds ago 101MB ubuntu-figlet latest e08b999c7978 About an hour ago 101MB ubuntu latest f63181f19b2f 30 hours ago 72.9MB It has created an image without a name or tag. That\u2019s a bit inconvenient. Exercise: Build a new image with a specific name. You can do that with adding the option -t to docker build . Before that, remove the nameless image. Hint An image without a name is usually a \u201cdangling image\u201d. You can remove those with docker image prune . Answer Remove the nameless image with docker image prune . After that, rebuild an image with a name: x86_64 / AMD64 ARM (MacOS M1 chip) docker build -t ubuntu-figlet:v2 . docker build --platform amd64 -t ubuntu-figlet:v2 .","title":"Basic Dockerfile"},{"location":"course_material/day1/dockerfiles/#using-cmd","text":"As you might remember the second positional argument of docker run is a command (i.e. docker run IMAGE [CMD] ). If you leave it empty, it uses the default command. You can change the default command in the Dockerfile with an instruction starting with CMD . For example: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD figlet My image works! Exercise: Build a new image based on the above Dockerfile . Can you validate the change using docker image inspect ? Can you overwrite this default with docker run ? Answer Copy the new line to your Dockerfile , and build the new image like this: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t ubuntu-figlet:v3 . docker build --platform amd64 -t ubuntu-figlet:v3 . The command docker inspect ubuntu-figlet:v3 will give: \"Cmd\": [ \"/bin/sh\", \"-c\", \"figlet My image works!\" ] So the default command ( /bin/bash ) has changed to figlet My image works! Running the image (with clean-up ( --rm )): docker run --rm ubuntu-figlet:v3 Will result in: __ __ _ _ _ | \\/ |_ _ (_)_ __ ___ __ _ __ _ ___ __ _____ _ __| | _____| | | |\\/| | | | | | | '_ ` _ \\ / _` |/ _` |/ _ \\ \\ \\ /\\ / / _ \\| '__| |/ / __| | | | | | |_| | | | | | | | | (_| | (_| | __/ \\ V V / (_) | | | <\\__ \\_| |_| |_|\\__, | |_|_| |_| |_|\\__,_|\\__, |\\___| \\_/\\_/ \\___/|_| |_|\\_\\___(_) |___/ |___/ And of course you can overwrite the default command: docker run --rm ubuntu-figlet:v3 figlet another text Resulting in: _ _ _ _ __ _ _ __ ___ | |_| |__ ___ _ __ | |_ _____ _| |_ / _` | '_ \\ / _ \\| __| '_ \\ / _ \\ '__| | __/ _ \\ \\/ / __| | (_| | | | | (_) | |_| | | | __/ | | || __/> <| |_ \\__,_|_| |_|\\___/ \\__|_| |_|\\___|_| \\__\\___/_/\\_\\\\__| Two flavours of CMD You have seen in the output of docker inspect that docker translates the command (i.e. figlet \"my image works!\" ) into this: [\"/bin/sh\", \"-c\", \"figlet 'My image works!'\"] . The notation we used in the Dockerfile is the shell notation while the notation with the square brackets ( [] ) is the exec-notation . You can use both notations in your Dockerfile . Altough the shell notation is more readable, the exec notation is directly used by the image, and therefore less ambiguous. A Dockerfile with shell notation: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD figlet My image works! A Dockerfile with exec notation: FROM ubuntu:jammy-20230308 RUN apt-get update RUN apt-get install figlet CMD [ \"/bin/sh\" , \"-c\" , \"figlet My image works!\" ] Exercise: Now push our created image (with a version tag) to docker hub. We will use it later for the singularity exercises . Answer docker tag ubuntu-figlet:v3 [ USER NAME ] /ubuntu-figlet:v3 docker push [ USER NAME ] /ubuntu-figlet:v3","title":"Using CMD"},{"location":"course_material/day1/dockerfiles/#build-an-image-for-your-own-script","text":"Often containers are built for a specific purpose. For example, you can use a container to ship all dependencies together with your developed set of scripts/programs. For that you will need to add your scripts to the container. That is quite easily done with the instruction COPY . However, in order to make your container more user-friendly, there are several additional instructions that can come in useful. We will treat the most frequently used ones below. Depending on your preference, either choose R or Python below. In the exercises will use a script called test_deseq2.R . This script will: Load the DESeq2 and optparse packages Load some additional packages to test their installations. We will use those packages later on in the course. Create and parse an option called --rows with optparse Create a dummy count matrix Run DESeq2 on the dummy count matrix Print the results to stdout You can download it here , or copy-paste it: test_deseq2.R #!/usr/bin/env Rscript # load packages required for this script write ( \"Loading packages required for this script\" , stderr ()) suppressPackageStartupMessages ({ library ( DESeq2 ) library ( optparse ) }) # load dependency packages for testing installations write ( \"Loading dependency packages for testing installations\" , stderr ()) suppressPackageStartupMessages ({ library ( apeglm ) library ( IHW ) library ( limma ) library ( data.table ) library ( ggplot2 ) library ( ggrepel ) library ( pheatmap ) library ( RColorBrewer ) library ( scales ) library ( stringr ) }) # parse options with optparse option_list <- list ( make_option ( c ( \"--rows\" ), type = \"integer\" , help = \"Number of rows in dummy matrix [default = %default]\" , default = 100 ) ) opt_parser <- OptionParser ( option_list = option_list , description = \"Runs DESeq2 on dummy data\" ) opt <- parse_args ( opt_parser ) # create a random dummy count matrix cnts <- matrix ( rnbinom ( n = opt $ row * 10 , mu = 100 , size = 1 / 0.5 ), ncol = 10 ) cond <- factor ( rep ( 1 : 2 , each = 5 )) # object construction dds <- DESeqDataSetFromMatrix ( cnts , DataFrame ( cond ), ~ cond ) # standard analysis dds <- DESeq ( dds ) res <- results ( dds ) # print results to stdout print ( res ) After you have downloaded it, make sure to set the permissions to executable: chmod +x test_deseq2.R It is a relatively simple script that runs DESeq2 on a dummy dataset. An example for execution would be: ./test_deseq2.R --rows 100 Here, --rows is a optional arguments that specifies the number of rows generated in the input count matrix. When running the script, it will return a bunch of messages and at the end an overview of differential gene expression analysis results: baseMean log2FoldChange lfcSE stat pvalue padj 1 66.1249 0.281757 0.727668 0.387206 0.698604 0.989804 2 76.9682 0.305763 0.619209 0.493796 0.621451 0.989804 3 64.7843 -0.694525 0.479445 -1.448603 0.147448 0.931561 4 123.0252 0.631247 0.688564 0.916758 0.359269 0.931561 5 93.2002 -0.453430 0.686043 -0.660936 0.508653 0.941951 ... ... ... ... ... ... ... 96 64.0177 0.757585137 0.682683 1.109718054 0.267121 0.931561 97 114.3689 -0.580010850 0.640313 -0.905823841 0.365029 0.931561 98 79.9620 0.000100617 0.612442 0.000164288 0.999869 0.999869 99 92.6614 0.563514308 0.716109 0.786910869 0.431334 0.939106 100 96.4410 -0.155268696 0.534400 -0.290547708 0.771397 0.989804 From the script you can see it has DESeq2 and optparse as dependencies. If we want to run the script inside a container, we would have to install them. We do this in the Dockerfile below. We give it the following instructions: use the r2u base image version jammy install the package DESeq2 , optparse and some additional packages we will need later on. We perform the installations with install2.r , which is a helper command that is present inside most rocker images. More info here . copy the script test_deseq2.R to /opt inside the container: FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt Note In order to use COPY , the file that needs to be copied needs to be in the same directory as the Dockerfile or one of its subdirectories. R image stack The most used R image stack is from the rocker project . It contains many different base images (e.g. with shiny, Rstudio, tidyverse etc.). It depends on the type of image whether installations with apt-get or install2.r are possible. To understand more about how to install R packages in different containers, check it this cheat sheet , or visit rocker-project.org . Exercise: Download the test_deseq2.R and build the image with docker build . Name the image deseq2 . After that, start an interactive session and execute the script inside the container. Hint Make an interactive session with the options -i and -t and use /bin/bash as the command. Answer Build the container: x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t deseq2 . docker build --platform amd64 -t deseq2 . Run the container: docker run -it --rm deseq2 /bin/bash Inside the container we look up the script: cd /opt ls This should return test_deseq2.R . Now you can execute it from inside the container: ./test_deseq2.R --rows 100 That\u2019s kind of nice. We can ship our R script inside our container. However, we don\u2019t want to run it interactively every time. So let\u2019s make some changes to make it easy to run it as an executable. For example, we can add /opt to the global $PATH variable with ENV . The $PATH variable The path variable is a special variable that consists of a list of path seperated by colons ( : ). These paths are searched if you are trying to run an executable. More info this topic at e.g. wikipedia . FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt ENV PATH = /opt: $PATH Note The ENV instruction can be used to set any variable. Exercise : Rebuild the image and start an interactive bash session inside the new image. Is the path variable updated? (i.e. can we execute test_deseq2.R from anywhere?) Answer After re-building we start an interactive session: docker run -it --rm deseq2 /bin/bash The path is upated, /opt is appended to the beginning of the variable: echo $PATH returns: /opt:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin Now you can try to execute it from the root directory (or any other): test_deseq2.R Instead of starting an interactive session with /bin/bash we can now more easily run the script non-interactively: docker run --rm deseq2 test_deseq2.R --rows 100 Now it will directly print the output of test_deseq2.R to stdout. In the case you want to pack your script inside a container, you are building a container specifically for your script, meaning you almost want the container to behave as the program itself. In order to do that, you can use ENTRYPOINT . ENTRYPOINT is similar to CMD , but has two important differences: ENTRYPOINT can not be overwritten by the positional arguments (i.e. docker run image [CMD] ), but has to be overwritten by --entrypoint . The positional arguments (or CMD ) are pasted to the ENTRYPOINT command. This means that you can use ENTRYPOINT as the executable and the positional arguments (or CMD ) as the options. Let\u2019s try it out: FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr COPY test_deseq2.R /opt ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] Exercise : Re-build, and run the container non-interactively without any positional arguments. After that, try to pass a different number of rows to --rows . How do the commands look? Answer Just running the container non-interactively would be: docker run --rm deseq2 Passing a different argument (i.e. overwriting CMD ) would be: docker run --rm deseq2 --rows 200 Here, the container behaves as the executable itself to which you can pass arguments. Most containerized applications need multiple build steps. Often, you want to perform these steps and executions in a specific directory. Therefore, it can be in convenient to specify a working directory. You can do that with WORKDIR . This instruction will set the default directory for all other instructions (like RUN , COPY etc.). It will also change the directory in which you will land if you run the container interactively. FROM rocker/r2u:jammy RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr WORKDIR /opt COPY test_deseq2.R . ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] Exercise : build the image, and start the container interactively. Has the default directory changed? After that, push the image to dockerhub, so we can use it later with the singularity exercises. Note You can overwrite ENTRYPOINT with --entrypoint as an argument to docker run . Answer Running the container interactively would be: docker run -it --rm --entrypoint /bin/bash deseq2 Which should result in a terminal looking something like this: root@9a27da455fb1:/opt# Meaning that indeed the default directory has changed to /opt Pushing it to dockerhub: docker tag deseq2 [ USER NAME ] /deseq2:v1 docker push [ USER NAME ] /deseq2:v1","title":"Build an image for your own script"},{"location":"course_material/day1/dockerfiles/#get-information-on-your-image-with-docker-inspect","text":"We have used docker inspect already in the previous chapter to find the default Cmd of the ubuntu image. However we can get more info on the image: e.g. the entrypoint, environmental variables, cmd, workingdir etc., you can use the Config record from the output of docker inspect . For our image this looks like: \"Config\" : { \"Hostname\" : \"\" , \"Domainname\" : \"\" , \"User\" : \"\" , \"AttachStdin\" : false , \"AttachStdout\" : false , \"AttachStderr\" : false , \"Tty\" : false , \"OpenStdin\" : false , \"StdinOnce\" : false , \"Env\" : [ \"PATH=/opt:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\" , \"LC_ALL=en_US.UTF-8\" , \"LANG=en_US.UTF-8\" , \"DEBIAN_FRONTEND=noninteractive\" , \"TZ=UTC\" ], \"Cmd\" : [ \"--rows\" , \"100\" ], \"ArgsEscaped\" : true , \"Image\" : \"\" , \"Volumes\" : null , \"WorkingDir\" : \"/opt\" , \"Entrypoint\" : [ \"test_deseq2.R\" ], \"OnBuild\" : null , \"Labels\" : { \"maintainer\" : \"Dirk Eddelbuettel \" , \"org.label-schema.license\" : \"GPL-2.0\" , \"org.label-schema.vcs-url\" : \"https://github.com/rocker-org/\" , \"org.label-schema.vendor\" : \"Rocker Project\" } }","title":"Get information on your image with docker inspect"},{"location":"course_material/day1/dockerfiles/#adding-metadata-to-your-image","text":"You can annotate your Dockerfile and the image by using the instruction LABEL . You can give it any key and value with = . However, it is recommended to use the Open Container Initiative (OCI) keys . Exercise : Annotate our Dockerfile with the OCI keys on the creation date, author and description. After that, check whether this has been passed to the actual image with docker inspect . Note You can type LABEL for each key-value pair, but you can also have it on one line by seperating the key-value pairs by a space, e.g.: LABEL keyx = \"valuex\" keyy = \"valuey\" Answer The Dockerfile would look like: FROM rocker/r2u:jammy LABEL org.opencontainers.image.created = \"2023-04-12\" \\ org.opencontainers.image.authors = \"Geert van Geest\" \\ org.opencontainers.image.description = \"Container with DESeq2 and friends\" RUN install2.r \\ DESeq2 \\ optparse \\ apeglm \\ IHW \\ limma \\ data.table \\ ggrepel \\ pheatmap \\ stringr WORKDIR /opt COPY test_deseq2.R . ENV PATH = /opt: $PATH # note that if you want to be able to combine the two # both ENTRYPOINT and CMD need to written in the exec form ENTRYPOINT [ \"test_deseq2.R\" ] # default option (if positional arguments are not specified) CMD [ \"--rows\" , \"100\" ] The Config record in the output of docker inspect was updated with: \"Labels\" : { \"org.opencontainers.image.authors\" : \"Geert van Geest\" , \"org.opencontainers.image.created\" : \"2023-04-12\" , \"org.opencontainers.image.description\" : \"Container with DESeq2 and friends\" , \"org.opencontainers.image.licenses\" : \"GPL-2.0-or-later\" , \"org.opencontainers.image.source\" : \"https://github.com/rocker-org/rocker\" , \"org.opencontainers.image.vendor\" : \"Rocker Project\" }","title":"Adding metadata to your image"},{"location":"course_material/day1/dockerfiles/#building-an-image-with-a-browser-interface","text":"In this exercise, we will use a different base image ( rocker/rstudio:4 ), and we\u2019ll install the same packages. Rstudio server is a nice browser interface that you can use for a.o. programming in R. With the image we are creating we will be able to run Rstudio server inside a container. Check out the Dockerfile : FROM rocker/rstudio:4 RUN apt-get update && \\ apt-get install -y libz-dev RUN install2.r \\ optparse \\ BiocManager RUN R -q -e 'BiocManager::install(\"biomaRt\")' This will create an image from the existing rstudio image. It will also install libz-dev with apt-get , BiocManager with install2.r and DESeq2 with an R command. Despite we\u2019re installing the same packages, the installation steps need to be different from the r-base image. This is because in the rocker/rstudio images R is installed from source, and therefore you can\u2019t install packages with apt-get . More information on how to install R packages in R containers in this cheat sheet , or visit rocker-project.org . Installation will take a while The installation of CRAN packages will go relatively quickly, because can use the binary packages supplied by Posit Public Package Manager . However, the installation of Bioconductor packages will take a while, because they need to be installed from source. If you don\u2019t have time, you can skip the DESeq2 installation by removing the last line of the Dockerfile . Exercise: Build an image based on this Dockerfile and give it a meaningful name. Answer x86_64 / AMD64 ARM64 (MacOS M1 chip) docker build -t rstudio-server . docker build --platform amd64 -t rstudio-server . You can now run a container from the image. However, you will have to tell docker where to publish port 8787 from the docker container with -p [HOSTPORT:CONTAINERPORT] . We choose to publish it to the same port number: docker run --rm -it -p 8787 :8787 rstudio-server Networking More info on docker container networking here By running the above command, a container will be started exposing rstudio server at port 8787 at localhost. You can approach the instance of Rstudio server by typing localhost:8787 in your browser. You will be asked for a password. You can find this password in the terminal from which you have started the container. We can make this even more interesting by mounting a local directory to the container running the Rstudio image: docker run \\ -it \\ --rm \\ -p 8787 :8787 \\ --mount type = bind,source = /Users/myusername/working_dir,target = /home/rstudio/working_dir \\ rstudio-server By doing this you have a completely isolated and shareable R environment running Rstudio server, but with your local files available to it. Pretty neat right?","title":"Building an image with a browser interface"},{"location":"course_material/day1/introduction_containers/","text":"Learning outcomes After having completed this chapter you will be able to: Discriminate between an image and a container Run a docker container from dockerhub interactively Validate the available containers and their status Material General introduction: Download the presentation Introduction to containers: Download the presentation Exercises We recommend using a code editor like VScode or Sublime text. If you don\u2019t know which one to chose, take VScode as we can provide most support for this editor. If working on Windows If you are working on Windows, it is easiest to work with WSL2 . With VScode use the WSL extension . Make sure you install the latest versions before you install docker. In principle, you can also use a native shell like PowerShell, but this might result into some issues with bind mounting directories. Work in projects We recommend to work in a project folder. This will make it easier to find your files and to share them with others. You can create a project folder anywhere on your computer. For example, you can create a folder projects in your home directory and then create a subfolder docker-snakemake-course in it. You can then open this folder in VScode. Let\u2019s create our first container from an existing image. We do this with the image ubuntu , generating an environment with a minimal installation of ubuntu. docker run -it ubuntu This will give you an interactive shell into the created container (this interactivity was invoked by the options -i and -t ) . Exercise: Check out the operating system of the container by typing cat /etc/os-release in the container\u2019s shell. Are we really in an ubuntu environment? Answer Yes: root@27f7d11608de:/# cat /etc/os-release NAME=\"Ubuntu\" VERSION=\"20.04.1 LTS (Focal Fossa)\" ID=ubuntu ID_LIKE=debian PRETTY_NAME=\"Ubuntu 20.04.1 LTS\" VERSION_ID=\"20.04\" HOME_URL=\"https://www.ubuntu.com/\" SUPPORT_URL=\"https://help.ubuntu.com/\" BUG_REPORT_URL=\"https://bugs.launchpad.net/ubuntu/\" PRIVACY_POLICY_URL=\"https://www.ubuntu.com/legal/terms-and-policies/privacy-policy\" VERSION_CODENAME=focal UBUNTU_CODENAME=focal Where does the image come from? If the image ubuntu was not on your computer yet, docker will search and try to get them from dockerhub , and download it. Exercise: Run the command whoami in the docker container. Who are you? Answer The command whoami returns the current user. In the container whoami will return root . This means you are the root user i.e. within the container you are admin and can basically change anything. Check out the container panel at the Docker dashboard (the Docker gui) or open another host terminal and type: docker container ls -a Exercise: What is the container status? Answer In Docker dashboard you can see that the shell is running: The output of docker container ls -a is: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 27f7d11608de ubuntu \"/bin/bash\" 7 minutes ago Up 6 minutes great_moser Also showing you that the STATUS is Up . Now let\u2019s install some software in our ubuntu environment. We\u2019ll install some simple software called figlet . Type into the container shell: apt-get update apt-get install figlet This will give some warnings This installation will give some warnings. It\u2019s safe to ignore them. Now let\u2019s try it out. Type into the container shell: figlet 'SIB courses are great!' Now you have installed and used software figlet in an ubuntu environment (almost) completely separated from your host computer. This already gives you an idea of the power of containerization. Exit the shell by typing exit . Check out the container panel of Docker dashboard or type: docker container ls -a Exercise: What is the container status? Answer docker container ls -a gives: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 27f7d11608de ubuntu \"/bin/bash\" 15 minutes ago Exited (0) 8 seconds ago great_moser Showing that the container has exited, meaning it\u2019s not running.","title":"Introduction to containers"},{"location":"course_material/day1/introduction_containers/#learning-outcomes","text":"After having completed this chapter you will be able to: Discriminate between an image and a container Run a docker container from dockerhub interactively Validate the available containers and their status","title":"Learning outcomes"},{"location":"course_material/day1/introduction_containers/#material","text":"General introduction: Download the presentation Introduction to containers: Download the presentation","title":"Material"},{"location":"course_material/day1/introduction_containers/#exercises","text":"We recommend using a code editor like VScode or Sublime text. If you don\u2019t know which one to chose, take VScode as we can provide most support for this editor. If working on Windows If you are working on Windows, it is easiest to work with WSL2 . With VScode use the WSL extension . Make sure you install the latest versions before you install docker. In principle, you can also use a native shell like PowerShell, but this might result into some issues with bind mounting directories. Work in projects We recommend to work in a project folder. This will make it easier to find your files and to share them with others. You can create a project folder anywhere on your computer. For example, you can create a folder projects in your home directory and then create a subfolder docker-snakemake-course in it. You can then open this folder in VScode. Let\u2019s create our first container from an existing image. We do this with the image ubuntu , generating an environment with a minimal installation of ubuntu. docker run -it ubuntu This will give you an interactive shell into the created container (this interactivity was invoked by the options -i and -t ) . Exercise: Check out the operating system of the container by typing cat /etc/os-release in the container\u2019s shell. Are we really in an ubuntu environment? Answer Yes: root@27f7d11608de:/# cat /etc/os-release NAME=\"Ubuntu\" VERSION=\"20.04.1 LTS (Focal Fossa)\" ID=ubuntu ID_LIKE=debian PRETTY_NAME=\"Ubuntu 20.04.1 LTS\" VERSION_ID=\"20.04\" HOME_URL=\"https://www.ubuntu.com/\" SUPPORT_URL=\"https://help.ubuntu.com/\" BUG_REPORT_URL=\"https://bugs.launchpad.net/ubuntu/\" PRIVACY_POLICY_URL=\"https://www.ubuntu.com/legal/terms-and-policies/privacy-policy\" VERSION_CODENAME=focal UBUNTU_CODENAME=focal Where does the image come from? If the image ubuntu was not on your computer yet, docker will search and try to get them from dockerhub , and download it. Exercise: Run the command whoami in the docker container. Who are you? Answer The command whoami returns the current user. In the container whoami will return root . This means you are the root user i.e. within the container you are admin and can basically change anything. Check out the container panel at the Docker dashboard (the Docker gui) or open another host terminal and type: docker container ls -a Exercise: What is the container status? Answer In Docker dashboard you can see that the shell is running: The output of docker container ls -a is: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 27f7d11608de ubuntu \"/bin/bash\" 7 minutes ago Up 6 minutes great_moser Also showing you that the STATUS is Up . Now let\u2019s install some software in our ubuntu environment. We\u2019ll install some simple software called figlet . Type into the container shell: apt-get update apt-get install figlet This will give some warnings This installation will give some warnings. It\u2019s safe to ignore them. Now let\u2019s try it out. Type into the container shell: figlet 'SIB courses are great!' Now you have installed and used software figlet in an ubuntu environment (almost) completely separated from your host computer. This already gives you an idea of the power of containerization. Exit the shell by typing exit . Check out the container panel of Docker dashboard or type: docker container ls -a Exercise: What is the container status? Answer docker container ls -a gives: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 27f7d11608de ubuntu \"/bin/bash\" 15 minutes ago Exited (0) 8 seconds ago great_moser Showing that the container has exited, meaning it\u2019s not running.","title":"Exercises"},{"location":"course_material/day1/managing_docker/","text":"Learning outcomes After having completed this chapter you will be able to: Explain the concept of layers in the context of docker containers and images Use the command line to restart and re-attach to an exited container Create a new image with docker commit List locally available images with docker image ls Run a command inside a container non-interactively Use docker image inspect to get more information on an image Use the command line to prune dangling images and stopped containers Rename and tag a docker image Push a newly created image to dockerhub Use the option --mount to bind mount a host directory to a container Material Download the presentation Overview of how docker works More on bind mounts Docker volumes in general Exercises Restarting an exited container If you would like to go back to your container with the figlet installation, you could try to run again: docker run -it ubuntu Exercise: Run the above command. Is your figlet installation still there? Why? Hint Check the status of your containers: docker container ls -a Answer No, the installation is gone. Another container was created from the same ubuntu image, without the figlet installation. Running the command docker container ls -a results in: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 8d7c4c611b70 ubuntu \"/bin/bash\" About a minute ago Up About a minute kind_mendel 27f7d11608de ubuntu \"/bin/bash\" 27 minutes ago Exited (0) 2 minutes ago great_moser In this case the container great_moser contains the figlet installation. But we have exited that container. We created a new container ( kind_mendel in this case) with a fresh environment created from the original ubuntu image. To restart your first created container, you\u2019ll have to look up its name. You can find it in the Docker dashboard, or with docker container ls -a . Container names The container name is the funny combination of two words separated by _ , e.g.: nifty_sinoussi . Alternatively you can use the container ID (the first column of the output of docker container ls ) To restart a container you can use: docker start [ CONTAINER NAME ] And after that to re-attach to the shell: docker attach [ CONTAINER NAME ] And you\u2019re back in the container shell. Exercise: Run the docker start and docker attach commands for the container that is supposed to contain the figlet installation. Is the installation of figlet still there? Answer yes: figlet 'try some more text!' Should give you output. docker attach and docker exec In addition to docker attach , you can also \u201cre-attach\u201d a container with docker exec . However, these two are quite different. While docker attach gets you back to your stopped shell process, docker exec creates a new one (more information on stackoverflow ). The command docker exec enables you therefore to have multiple shells open in the same container. That can be convenient if you have one shell open with a program running in the foreground, and another one for e.g. monitoring. An example for using docker exec on a running container: docker exec -it [ CONTAINER NAME ] /bin/bash Note that docker exec requires a CMD, it doesn\u2019t use the default. Creating a new image You can store your changes and create a new image based on the ubuntu image like this: docker commit [ CONTAINER NAME ] ubuntu-figlet Exercise: Run the above command with the name of the container containing the figlet installation. Check out docker image ls . What have we just created? Answer A new image called ubuntu-figlet based on the status of the container. The output of docker image ls should look like: REPOSITORY TAG IMAGE ID CREATED SIZE ubuntu-figlet latest e08b999c7978 4 seconds ago 101MB ubuntu latest f63181f19b2f 29 hours ago 72.9MB Now you can generate a new container based on the new image: docker run -it ubuntu-figlet Exercise: Run the above command. Is the figlet installation in the created container? Answer yes Commands The second positional argument of docker run can be a command followed by its arguments. So, we could run a container non-interactively (without -it ), and just let it run a single command: docker run ubuntu-figlet figlet 'non-interactive run' Resulting in just the output of the figlet command. In the previous exercises we have run containers without a command as positional argument. This doesn\u2019t mean that no command has been run, because the container would do nothing without a command. The default command is stored in the image, and you can find it by docker image inspect [IMAGE NAME] . Exercise: Have a look at the output of docker image inspect , particularly at \"Config\" (ignore \"ContainerConfig\" for now). What is the default command ( CMD ) of the ubuntu image? Answer Running docker image inspect ubuntu gives (amongst other information): \"Cmd\" : [ \"/bin/bash\" ] , In the case of the ubuntu the default command is bash , returning a shell in bash (i.e. Bourne again shell ). Adding the options -i and -t ( -it ) to your docker run command will therefore result in an interactive bash shell. You can modify this default behaviour. More on that later, when we will work on Dockerfiles . The difference between Config and ContainerConfig The configuration at Config represents the image, the configuration at ContainerConfig the last step during the build of the image, i.e. the last layer. More info e.g. at this post at stackoverflow . Removing containers In the meantime, with every call of docker run we have created a new container (check your containers with docker container ls -a ). You probably don\u2019t want to remove those one-by-one. These two commands are very useful to clean up your Docker cache: docker container prune : removes stopped containers docker image prune : removes dangling images (i.e. images without a name) So, remove your stopped containers with: docker container prune Unless you\u2019re developing further on a container, or you\u2019re using it for an analysis, you probably want to get rid of it once you have exited the container. You can do this with adding --rm to your docker run command, e.g.: docker run --rm ubuntu-figlet figlet 'non-interactive run' Pushing to dockerhub Now that we have created our first own docker image, we can store it and share it with the world on docker hub. Before we get there, we first have to (re)name and tag it. Before pushing an image to dockerhub, docker has to know to which user and which repository the image should be added. That information should be in the name of the image, like this: user/imagename . We can rename an image with docker tag (which is a bit of misleading name for the command). So we could push to dockerhub like this: docker tag ubuntu-figlet [USER NAME]/ubuntu-figlet docker push [USER NAME]/ubuntu-figlet If on Linux If you are on Linux and haven\u2019t connected to docker hub before, you will have login first. To do that, run: docker login How docker makes money All images pushed to dockerhub are open to the world. With a free account you can have one image on dockerhub that is private. Paid accounts can have more private images, and are therefore popular for commercial organisations. As an alternative to dockerhub, you can store images locally with docker save . We didn\u2019t specify the tag for our new image. That\u2019s why docker tag gave it the default tag called latest . Pushing an image without a tag will overwrite the current image with the tag latest (more on (not) using latest here ). If you want to maintain multiple versions of your image, you will have to add a tag, and push the image with that tag to dockerhub: docker tag ubuntu-figlet [USER NAME]/ubuntu-figlet:v1 docker push [USER NAME]/ubuntu-figlet:v1 Mounting a directory For many analyses you do calculations with files or scripts that are on your host (local) computer. But how do you make them available to a docker container? You can do that in several ways, but here we will use bind-mount. You can bind-mount a directory with -v ( --volume ) or --mount . Most old-school docker users will use -v , but --mount syntax is easier to understand and now recommended, so we will use the latter here: docker run \\ --mount type = bind,source = /host/source/path,target = /path/in/container \\ [ IMAGE ] The target directory will be created if it does not yet exist. The source directory should exist. MobaXterm users You can specify your local path with the Windows syntax (e.g. C:\\Users\\myusername ). However, you will have to use forward slashes ( / ) instead of backward slashes ( \\ ). Therefore, mounting a directory would look like: docker run \\ --mount type = bind,source = C:/Users/myusername,target = /path/in/container \\ [ IMAGE ] Do not use autocompletion or variable substitution (e.g. $PWD ) in MobaXterm, since these point to \u2018emulated\u2019 paths, and are not passed properly to the docker command. Using docker from Windows PowerShell Most of the syntax for docker is the same for both PowerShell and UNIX-based systems. However, there are some differences, e.g. in Windows, directories in file paths are separated by \\ instead of / . Also, line breaks are not escaped by \\ but by `. Exercise: Mount a host (local) directory to a target directory /working_dir in a container created from the ubuntu-figlet image and run it interactively. Check whether the target directory has been created. Answer e.g. on Mac OS this would be: docker run \\ -it \\ --mount type = bind,source = /Users/myusername/working_dir,target = /working_dir/ \\ ubuntu-figlet This creates a directory called working_dir in the root directory ( / ): root@8d80a8698865:/# ls bin dev home lib32 libx32 mnt proc run srv tmp var boot etc lib lib64 media opt root sbin sys usr working_dir This mounted directory is both available for the host (locally) and for the container. You can therefore e.g. copy files in there, and write output generated by the container. Exercise: Write the output of figlet \"testing mounted dir\" to a file in /working_dir . Check whether it is available on the host (locally) in the source directory. Hint You can write the output of figlet to a file like this: figlet 'some string' > file.txt Answer root@8d80a8698865:/# figlet 'testing mounted dir' > /working_dir/figlet_output.txt This should create a file in both your host (local) source directory and the target directory in the container called figlet_output.txt . Using files on the host This of course also works the other way around. If you would have a file on the host with e.g. a text, you can copy it into your mounted directory, and it will be available to the container. Managing permissions (extra) Depending on your system, the user ID and group ID will be taken over from the user inside the container. If the user inside the container is root, this will be root. That\u2019s a bit inconvenient if you just want to run the container as a regular user (for example in certain circumstances your container could write in / ). To do that, use the -u option, and specify the group ID and user ID like this: docker run -u [ uid ] : [ gid ] So, e.g.: docker run \\ -it \\ -u 1000 :1000 \\ --mount type = bind,source = /Users/myusername/working_dir,target = /working_dir/ \\ ubuntu-figlet If you want docker to take over your current uid and gid, you can use: docker run -u \"$(id -u):$(id -g)\" This behaviour is different on MacOS and MobaXterm On MacOS and in the local shell of MobaXterm the uid and gid are taken over from the user running the container (even if you set -u as 0:0), i.e. your current ID. More info on stackoverflow . Exercise: Start an interactive container based on the ubuntu-figlet image, bind-mount a local directory and take over your current uid and gid . Write the output of a figlet command to a file in the mounted directory. Who and which group owns the file inside the container? And outside the container? Answer the same question but now run the container without setting -u . Answer Linux MacOS MobaXterm Running ubuntu-figlet interactively while taking over uid and gid and mounting my current directory: docker run -it --mount type = bind,source = $PWD ,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=1000 gid=1000 groups=1000 So, I have taken over uid 1000 and gid 1000. I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 1000 1000 0 Mar 400 13:37 uid_set.txt So the file belongs to user 1000, and group 1000. Outside container: ubuntu@ip-172-31-33-21:~$ ls -lh -rw-r--r-- 1 ubuntu ubuntu 400 Mar 5 13:37 uid_set.txt Which makes sense: ubuntu@ip-172-31-33-21:~$ id uid=1000(ubuntu) gid=1000(ubuntu) groups=1000(ubuntu) Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = $PWD ,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, uid and gid are root . root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 1000 1000 400 Mar 5 13:37 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:40 uid_unset.txt Outside container: ubuntu@ip-172-31-33-21:~$ ls -lh -rw-r--r-- 1 ubuntu ubuntu 0 Mar 5 13:37 uid_set.txt -rw-r--r-- 1 root root 0 Mar 5 13:40 uid_unset.txt So, the uid and gid 0 (root:root) are taken over. Running ubuntu-figlet interactively while taking over uid and gid and mounting my current directory: docker run -it --mount type = bind,source = $PWD ,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=503 gid=20(dialout) groups=20(dialout) So, the container has taken over uid 503 and group 20 I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 503 dialout 400 Mar 5 13:11 uid_set.txt So the file belongs to user 503, and the group dialout . Outside container: mac-34392:~ geertvangeest$ ls -lh -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:11 uid_set.txt Which are the same as inside the container: mac-34392:~ geertvangeest$ echo \"$(id -u):$(id -g)\" 503:20 The uid 503 was nameless in the docker container. However the group 20 already existed in the ubuntu container, and was named dialout . Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = $PWD ,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, inside the container I am root . Creating new files will lead to ownership of root inside the container: root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 503 dialout 400 Mar 5 13:11 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:25 uid_unset.txt Outside container: mac-34392:~ geertvangeest$ ls -lh -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:11 uid_set.txt -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:15 uid_unset.txt So, the uid and gid 0 (root:root) are not taken over. Instead, the uid and gid of the user running docker were used. Running ubuntu-figlet interactively while taking over uid and gid and mounting to a specfied directory: docker run -it --mount type = bind,source = C:/Users/geert/data,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=1003 gid=513 groups=513 So, the container has taken over uid 1003 and group 513 I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 1003 513 400 Mar 5 13:11 uid_set.txt So the file belongs to user 1003, and the group 513. Outside container: /home/mobaxterm/data$ ls -lh -rwx------ 1 geert UserGrp 400 Mar 5 14:11 uid_set.txt Which are the same as inside the container: /home/mobaxterm/data$ echo \"$(id -u):$(id -g)\" 1003:513 Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = C:/Users/geert/data,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, inside the container I am root . Creating new files will lead to ownership of root inside the container: root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 1003 503 400 Mar 5 13:11 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:25 uid_unset.txt Outside container: /home/mobaxterm/data$ ls -lh -rwx------ 1 geert UserGrp 400 Mar 5 14:11 uid_set.txt -rwx------ 1 geert UserGrp 400 Mar 5 14:15 uid_unset.txt So, the uid and gid 0 (root:root) are not taken over. Instead, the uid and gid of the user running docker were used.","title":"Managing containers and images"},{"location":"course_material/day1/managing_docker/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain the concept of layers in the context of docker containers and images Use the command line to restart and re-attach to an exited container Create a new image with docker commit List locally available images with docker image ls Run a command inside a container non-interactively Use docker image inspect to get more information on an image Use the command line to prune dangling images and stopped containers Rename and tag a docker image Push a newly created image to dockerhub Use the option --mount to bind mount a host directory to a container","title":"Learning outcomes"},{"location":"course_material/day1/managing_docker/#material","text":"Download the presentation Overview of how docker works More on bind mounts Docker volumes in general","title":"Material"},{"location":"course_material/day1/managing_docker/#exercises","text":"","title":"Exercises"},{"location":"course_material/day1/managing_docker/#restarting-an-exited-container","text":"If you would like to go back to your container with the figlet installation, you could try to run again: docker run -it ubuntu Exercise: Run the above command. Is your figlet installation still there? Why? Hint Check the status of your containers: docker container ls -a Answer No, the installation is gone. Another container was created from the same ubuntu image, without the figlet installation. Running the command docker container ls -a results in: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 8d7c4c611b70 ubuntu \"/bin/bash\" About a minute ago Up About a minute kind_mendel 27f7d11608de ubuntu \"/bin/bash\" 27 minutes ago Exited (0) 2 minutes ago great_moser In this case the container great_moser contains the figlet installation. But we have exited that container. We created a new container ( kind_mendel in this case) with a fresh environment created from the original ubuntu image. To restart your first created container, you\u2019ll have to look up its name. You can find it in the Docker dashboard, or with docker container ls -a . Container names The container name is the funny combination of two words separated by _ , e.g.: nifty_sinoussi . Alternatively you can use the container ID (the first column of the output of docker container ls ) To restart a container you can use: docker start [ CONTAINER NAME ] And after that to re-attach to the shell: docker attach [ CONTAINER NAME ] And you\u2019re back in the container shell. Exercise: Run the docker start and docker attach commands for the container that is supposed to contain the figlet installation. Is the installation of figlet still there? Answer yes: figlet 'try some more text!' Should give you output. docker attach and docker exec In addition to docker attach , you can also \u201cre-attach\u201d a container with docker exec . However, these two are quite different. While docker attach gets you back to your stopped shell process, docker exec creates a new one (more information on stackoverflow ). The command docker exec enables you therefore to have multiple shells open in the same container. That can be convenient if you have one shell open with a program running in the foreground, and another one for e.g. monitoring. An example for using docker exec on a running container: docker exec -it [ CONTAINER NAME ] /bin/bash Note that docker exec requires a CMD, it doesn\u2019t use the default.","title":"Restarting an exited container"},{"location":"course_material/day1/managing_docker/#creating-a-new-image","text":"You can store your changes and create a new image based on the ubuntu image like this: docker commit [ CONTAINER NAME ] ubuntu-figlet Exercise: Run the above command with the name of the container containing the figlet installation. Check out docker image ls . What have we just created? Answer A new image called ubuntu-figlet based on the status of the container. The output of docker image ls should look like: REPOSITORY TAG IMAGE ID CREATED SIZE ubuntu-figlet latest e08b999c7978 4 seconds ago 101MB ubuntu latest f63181f19b2f 29 hours ago 72.9MB Now you can generate a new container based on the new image: docker run -it ubuntu-figlet Exercise: Run the above command. Is the figlet installation in the created container? Answer yes","title":"Creating a new image"},{"location":"course_material/day1/managing_docker/#commands","text":"The second positional argument of docker run can be a command followed by its arguments. So, we could run a container non-interactively (without -it ), and just let it run a single command: docker run ubuntu-figlet figlet 'non-interactive run' Resulting in just the output of the figlet command. In the previous exercises we have run containers without a command as positional argument. This doesn\u2019t mean that no command has been run, because the container would do nothing without a command. The default command is stored in the image, and you can find it by docker image inspect [IMAGE NAME] . Exercise: Have a look at the output of docker image inspect , particularly at \"Config\" (ignore \"ContainerConfig\" for now). What is the default command ( CMD ) of the ubuntu image? Answer Running docker image inspect ubuntu gives (amongst other information): \"Cmd\" : [ \"/bin/bash\" ] , In the case of the ubuntu the default command is bash , returning a shell in bash (i.e. Bourne again shell ). Adding the options -i and -t ( -it ) to your docker run command will therefore result in an interactive bash shell. You can modify this default behaviour. More on that later, when we will work on Dockerfiles . The difference between Config and ContainerConfig The configuration at Config represents the image, the configuration at ContainerConfig the last step during the build of the image, i.e. the last layer. More info e.g. at this post at stackoverflow .","title":"Commands"},{"location":"course_material/day1/managing_docker/#removing-containers","text":"In the meantime, with every call of docker run we have created a new container (check your containers with docker container ls -a ). You probably don\u2019t want to remove those one-by-one. These two commands are very useful to clean up your Docker cache: docker container prune : removes stopped containers docker image prune : removes dangling images (i.e. images without a name) So, remove your stopped containers with: docker container prune Unless you\u2019re developing further on a container, or you\u2019re using it for an analysis, you probably want to get rid of it once you have exited the container. You can do this with adding --rm to your docker run command, e.g.: docker run --rm ubuntu-figlet figlet 'non-interactive run'","title":"Removing containers"},{"location":"course_material/day1/managing_docker/#pushing-to-dockerhub","text":"Now that we have created our first own docker image, we can store it and share it with the world on docker hub. Before we get there, we first have to (re)name and tag it. Before pushing an image to dockerhub, docker has to know to which user and which repository the image should be added. That information should be in the name of the image, like this: user/imagename . We can rename an image with docker tag (which is a bit of misleading name for the command). So we could push to dockerhub like this: docker tag ubuntu-figlet [USER NAME]/ubuntu-figlet docker push [USER NAME]/ubuntu-figlet If on Linux If you are on Linux and haven\u2019t connected to docker hub before, you will have login first. To do that, run: docker login How docker makes money All images pushed to dockerhub are open to the world. With a free account you can have one image on dockerhub that is private. Paid accounts can have more private images, and are therefore popular for commercial organisations. As an alternative to dockerhub, you can store images locally with docker save . We didn\u2019t specify the tag for our new image. That\u2019s why docker tag gave it the default tag called latest . Pushing an image without a tag will overwrite the current image with the tag latest (more on (not) using latest here ). If you want to maintain multiple versions of your image, you will have to add a tag, and push the image with that tag to dockerhub: docker tag ubuntu-figlet [USER NAME]/ubuntu-figlet:v1 docker push [USER NAME]/ubuntu-figlet:v1","title":"Pushing to dockerhub"},{"location":"course_material/day1/managing_docker/#mounting-a-directory","text":"For many analyses you do calculations with files or scripts that are on your host (local) computer. But how do you make them available to a docker container? You can do that in several ways, but here we will use bind-mount. You can bind-mount a directory with -v ( --volume ) or --mount . Most old-school docker users will use -v , but --mount syntax is easier to understand and now recommended, so we will use the latter here: docker run \\ --mount type = bind,source = /host/source/path,target = /path/in/container \\ [ IMAGE ] The target directory will be created if it does not yet exist. The source directory should exist. MobaXterm users You can specify your local path with the Windows syntax (e.g. C:\\Users\\myusername ). However, you will have to use forward slashes ( / ) instead of backward slashes ( \\ ). Therefore, mounting a directory would look like: docker run \\ --mount type = bind,source = C:/Users/myusername,target = /path/in/container \\ [ IMAGE ] Do not use autocompletion or variable substitution (e.g. $PWD ) in MobaXterm, since these point to \u2018emulated\u2019 paths, and are not passed properly to the docker command. Using docker from Windows PowerShell Most of the syntax for docker is the same for both PowerShell and UNIX-based systems. However, there are some differences, e.g. in Windows, directories in file paths are separated by \\ instead of / . Also, line breaks are not escaped by \\ but by `. Exercise: Mount a host (local) directory to a target directory /working_dir in a container created from the ubuntu-figlet image and run it interactively. Check whether the target directory has been created. Answer e.g. on Mac OS this would be: docker run \\ -it \\ --mount type = bind,source = /Users/myusername/working_dir,target = /working_dir/ \\ ubuntu-figlet This creates a directory called working_dir in the root directory ( / ): root@8d80a8698865:/# ls bin dev home lib32 libx32 mnt proc run srv tmp var boot etc lib lib64 media opt root sbin sys usr working_dir This mounted directory is both available for the host (locally) and for the container. You can therefore e.g. copy files in there, and write output generated by the container. Exercise: Write the output of figlet \"testing mounted dir\" to a file in /working_dir . Check whether it is available on the host (locally) in the source directory. Hint You can write the output of figlet to a file like this: figlet 'some string' > file.txt Answer root@8d80a8698865:/# figlet 'testing mounted dir' > /working_dir/figlet_output.txt This should create a file in both your host (local) source directory and the target directory in the container called figlet_output.txt . Using files on the host This of course also works the other way around. If you would have a file on the host with e.g. a text, you can copy it into your mounted directory, and it will be available to the container.","title":"Mounting a directory"},{"location":"course_material/day1/managing_docker/#managing-permissions-extra","text":"Depending on your system, the user ID and group ID will be taken over from the user inside the container. If the user inside the container is root, this will be root. That\u2019s a bit inconvenient if you just want to run the container as a regular user (for example in certain circumstances your container could write in / ). To do that, use the -u option, and specify the group ID and user ID like this: docker run -u [ uid ] : [ gid ] So, e.g.: docker run \\ -it \\ -u 1000 :1000 \\ --mount type = bind,source = /Users/myusername/working_dir,target = /working_dir/ \\ ubuntu-figlet If you want docker to take over your current uid and gid, you can use: docker run -u \"$(id -u):$(id -g)\" This behaviour is different on MacOS and MobaXterm On MacOS and in the local shell of MobaXterm the uid and gid are taken over from the user running the container (even if you set -u as 0:0), i.e. your current ID. More info on stackoverflow . Exercise: Start an interactive container based on the ubuntu-figlet image, bind-mount a local directory and take over your current uid and gid . Write the output of a figlet command to a file in the mounted directory. Who and which group owns the file inside the container? And outside the container? Answer the same question but now run the container without setting -u . Answer Linux MacOS MobaXterm Running ubuntu-figlet interactively while taking over uid and gid and mounting my current directory: docker run -it --mount type = bind,source = $PWD ,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=1000 gid=1000 groups=1000 So, I have taken over uid 1000 and gid 1000. I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 1000 1000 0 Mar 400 13:37 uid_set.txt So the file belongs to user 1000, and group 1000. Outside container: ubuntu@ip-172-31-33-21:~$ ls -lh -rw-r--r-- 1 ubuntu ubuntu 400 Mar 5 13:37 uid_set.txt Which makes sense: ubuntu@ip-172-31-33-21:~$ id uid=1000(ubuntu) gid=1000(ubuntu) groups=1000(ubuntu) Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = $PWD ,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, uid and gid are root . root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 1000 1000 400 Mar 5 13:37 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:40 uid_unset.txt Outside container: ubuntu@ip-172-31-33-21:~$ ls -lh -rw-r--r-- 1 ubuntu ubuntu 0 Mar 5 13:37 uid_set.txt -rw-r--r-- 1 root root 0 Mar 5 13:40 uid_unset.txt So, the uid and gid 0 (root:root) are taken over. Running ubuntu-figlet interactively while taking over uid and gid and mounting my current directory: docker run -it --mount type = bind,source = $PWD ,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=503 gid=20(dialout) groups=20(dialout) So, the container has taken over uid 503 and group 20 I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 503 dialout 400 Mar 5 13:11 uid_set.txt So the file belongs to user 503, and the group dialout . Outside container: mac-34392:~ geertvangeest$ ls -lh -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:11 uid_set.txt Which are the same as inside the container: mac-34392:~ geertvangeest$ echo \"$(id -u):$(id -g)\" 503:20 The uid 503 was nameless in the docker container. However the group 20 already existed in the ubuntu container, and was named dialout . Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = $PWD ,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, inside the container I am root . Creating new files will lead to ownership of root inside the container: root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 503 dialout 400 Mar 5 13:11 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:25 uid_unset.txt Outside container: mac-34392:~ geertvangeest$ ls -lh -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:11 uid_set.txt -rw-r--r-- 1 geertvangeest staff 400B Mar 5 14:15 uid_unset.txt So, the uid and gid 0 (root:root) are not taken over. Instead, the uid and gid of the user running docker were used. Running ubuntu-figlet interactively while taking over uid and gid and mounting to a specfied directory: docker run -it --mount type = bind,source = C:/Users/geert/data,target = /data -u \" $( id -u ) : $( id -g ) \" ubuntu-figlet Inside container: I have no name!@e808d7c36e7c:/$ id uid=1003 gid=513 groups=513 So, the container has taken over uid 1003 and group 513 I have no name!@e808d7c36e7c:/$ cd /data I have no name!@e808d7c36e7c:/data$ figlet 'uid set' > uid_set.txt I have no name!@e808d7c36e7c:/data$ ls -lh -rw-r--r-- 1 1003 513 400 Mar 5 13:11 uid_set.txt So the file belongs to user 1003, and the group 513. Outside container: /home/mobaxterm/data$ ls -lh -rwx------ 1 geert UserGrp 400 Mar 5 14:11 uid_set.txt Which are the same as inside the container: /home/mobaxterm/data$ echo \"$(id -u):$(id -g)\" 1003:513 Running ubuntu-figlet interactively without taking over uid and gid : docker run -it --mount type = bind,source = C:/Users/geert/data,target = /data ubuntu-figlet Inside container: root@fface8afb220:/# id uid=0(root) gid=0(root) groups=0(root) So, inside the container I am root . Creating new files will lead to ownership of root inside the container: root@fface8afb220:/# cd /data root@fface8afb220:/data# figlet 'uid unset' > uid_unset.txt root@fface8afb220:/data# ls -lh -rw-r--r-- 1 1003 503 400 Mar 5 13:11 uid_set.txt -rw-r--r-- 1 root root 400 Mar 5 13:25 uid_unset.txt Outside container: /home/mobaxterm/data$ ls -lh -rwx------ 1 geert UserGrp 400 Mar 5 14:11 uid_set.txt -rwx------ 1 geert UserGrp 400 Mar 5 14:15 uid_unset.txt So, the uid and gid 0 (root:root) are not taken over. Instead, the uid and gid of the user running docker were used.","title":"Managing permissions (extra)"},{"location":"course_material/day1/singularity/","text":"Learning outcomes After having completed this chapter you will be able to: Login to a remote machine with ssh Use apptainer pull to convert an image from dockerhub to the \u2018apptainer image format\u2019 ( .sif ) Execute a apptainer container Explain the difference in default mounting behaviour between docker and apptainer Use apptainer shell to generate an interactive shell inside a .sif image Search and use images with both docker and apptainer from bioconda Material Download the presentation Apptainer documentation Apptainer hub An article on Docker vs Apptainer Using conda and containers with snakemake Exercises Login to remote If you are enrolled in the course, you have received an e-mail with an IP, username, private key and password. To do the Apptainer exercises we will login to a remote server. Below you can find instructions on how to login. VScode is a code editor that can be used to edit files and run commands locally, but also on a remote server. In this subchapter we will set up VScode to work remotely. If not working with VScode If you are not working with VScode, you can login to the remote server with the following command: ssh -i key_username.pem If you want to edit files directly on the server, you can mount a directory with sshfs . Required installations For this exercise it is easiest if you use VScode . In addition you would need to have followed the instructions to set up remote-ssh: OpenSSH compatible client . This is usually pre-installed on your OS. You can check whether the command ssh exists. The Remote-SSH extension. To install, open VSCode and click on the extensions icon (four squares) on the left side of the window. Search for Remote-SSH and click on Install . Windows mac OS/Linux Open a PowerShell and cd to the directory where you have stored your private key. After that, move it to ~\\.ssh : mv .\\ key_username . pem ~\\. ssh Open a terminal, and cd to the directory where you have stored your private key. After that, change the file permissions of the key and move it to ~/.ssh : chmod 400 key_username.pem mv key_username.pem ~/.ssh Open VScode and click on the green or blue button in the bottom left corner. Select Connect to Host... , and then on Configure SSH Host... . Specify a the location for the config file. Use the same directory as where your keys are stored (so ~/.ssh ). A skeleton config file will be provided. Edit it, so it looks like this (replace username with your username, and specify the correct IP at HostName ): Windows MacOS/Linux Host sib_course_remote User username HostName 123.456.789.123 IdentityFile ~\\.ssh\\key_username.pem Host sib_course_remote User username HostName 123.456.789.123 IdentityFile ~/.ssh/key_username.pem Save and close the config file. Now click again the green or blue button in the bottom left corner. Select Connect to Host... , and then on sib_course_remote . You will be asked which operating system is used on the remote. Specify \u2018Linux\u2019. Pulling an image Apptainer can take several image formats (e.g. a docker image), and convert them into it\u2019s own .sif format. Unlike docker this image doesn\u2019t live in a local image cache, but it\u2019s stored as an actual file. Exercise: On the remote server, pull the docker image that has the adjusted default CMD that we have pushed to dockerhub in this exercise ( ubuntu-figlet-df:v3 ) with apptainer pull . The syntax is: apptainer pull docker:// [ USER NAME ] / [ IMAGE NAME ] : [ TAG ] Answer apptainer pull docker:// [ USER NAME ] /ubuntu-figlet:v3 This will result in a file called ubuntu-figlet_v3.sif Note If you weren\u2019t able to push the image in the previous exercises to your docker hub, you can use geertvangeest as username to pull the image. Executing an image These .sif files can be run as standalone executables: ./ubuntu-figlet_v3.sif Note This is shorthand for: apptainer run ubuntu-figlet_v3.sif And you can overwrite the default command like this: apptainer run [ IMAGE NAME ] .sif [ COMMAND ] Note In this case, you can also use ./ [ IMAGE NAME ] .sif [ COMMAND ] However, most applications require apptainer run . Especially if you want to provide options like --bind (for mounting directories). Exercise: Run the .sif file without a command, and with a command that runs figlet . Do you get expected output? Do the same for the R image you\u2019ve created in the previous chapter. Entrypoint and apptainer The daterange image has an entrypoint set, and apptainer run does not overwrite it. In order to ignore both the entrypoint and cmd use apptainer exec . Answer Running it without a command ( ./ubuntu-figlet_v3.sif ) should give: __ __ _ _ _ | \\/ |_ _ (_)_ __ ___ __ _ __ _ ___ __ _____ _ __| | _____| | | |\\/| | | | | | | '_ ` _ \\ / _` |/ _` |/ _ \\ \\ \\ /\\ / / _ \\| '__| |/ / __| | | | | | |_| | | | | | | | | (_| | (_| | __/ \\ V V / (_) | | | <\\__ \\_| |_| |_|\\__, | |_|_| |_| |_|\\__,_|\\__, |\\___| \\_/\\_/ \\___/|_| |_|\\_\\___(_) |___/ |___/ Which is the default command that we changed in the Dockerfile . Running with a another figlet command: ./ubuntu-figlet_v3.sif figlet 'Something else' Should give: ____ _ _ _ _ / ___| ___ _ __ ___ ___| |_| |__ (_)_ __ __ _ ___| |___ ___ \\___ \\ / _ \\| '_ ` _ \\ / _ \\ __| '_ \\| | '_ \\ / _` | / _ \\ / __|/ _ \\ ___) | (_) | | | | | | __/ |_| | | | | | | | (_| | | __/ \\__ \\ __/ |____/ \\___/|_| |_| |_|\\___|\\__|_| |_|_|_| |_|\\__, | \\___|_|___/\\___| |___/ Pulling the deseq2 image: apptainer pull docker:// [ USER NAME ] /deseq2:v1 Running it without command: ./deseq2.sif Running with a command: ./deseq2.sif --rows 100 To overwrite both entrypoint and the command: apptainer exec deseq2.sif test_deseq2.R --rows 200 Mounting with Apptainer Apptainer is also different from Docker in the way it handles mounting. By default, Apptainer binds your home directory and a number of paths in the root directory to the container. This results in behaviour that is almost like if you are working on the directory structure of the host. If your directory is not mounted by default It depends on the apptainer settings whether most directories are mounted by default to the container. If your directory is not mounted, you can do that with the --bind option of apptainer exec : apptainer exec --bind /my/dir/to/mount/ [ IMAGE NAME ] .sif [ COMMAND ] Running the command pwd (full name of current working directory) will therefore result in a path on the host machine: ./ubuntu-figlet_v3.sif pwd Exercise: Run the above command. What is the output? How would the output look like if you would run a similar command with Docker? Hint A similar Docker command would look like (run this on your local computer): docker run --rm ubuntu-figlet:v3 pwd Answer The output of ./ubuntu-figlet_v3.sif pwd is the current directory on the host: i.e. /home/username if you have it in your home directory. The output of docker run --rm ubuntu-figlet:v3 pwd (on the local host) would be / , which is the default workdir (root directory) of the container. As we did not mount any host directory, this directory exists only within the container (i.e. separated from the host). Interactive shell If you want to debug or inspect an image, it can be helpful to have a shell inside the container. You can do that with apptainer shell : apptainer shell ubuntu-figlet_v3.sif Note To exit the shell type exit . Exercise: Can you run figlet inside this shell? Answer Yes: Apptainer> figlet test _ _ | |_ ___ ___| |_ | __/ _ \\/ __| __| | || __/\\__ \\ |_ \\__\\___||___/\\__| During the lecture you have learned that apptainer takes over the user privileges of the user on the host. You can get user information with command like whoami , id , groups etc. Exercise: Run the figlet container interactively. Do you have the same user privileges as if you were on the host? How is that with docker ? Answer A command like whoami will result in your username printed at stdout: Apptainer> whoami myusername Apptainer> id uid=1030(myusername) gid=1031(myusername) groups=1031(myusername),1001(condausers) Apptainer> groups myusername condausers With apptainer, you have the same privileges inside the apptainer container as on the host. If you do this in the docker container (based on the same image), you\u2019ll get output like this: root@a3d6e59dc19d:/# whoami root root@a3d6e59dc19d:/# groups root root@a3d6e59dc19d:/# id uid=0(root) gid=0(root) groups=0(root) A bioinformatics example (extra) All bioconda packages also have a pre-built container. Have a look at the bioconda website , and search for fastqc . In the search results, click on the appropriate record (i.e. package \u2018fastqc\u2019). Now, scroll down and find the namespace and tag for the latest fastqc image. Now we can pull it with apptainer like this: apptainer pull docker://quay.io/biocontainers/fastqc:0.11.9--hdfd78af_1 Let\u2019s test the image. Download some sample reads first: mkdir reads cd reads wget https://introduction-containers.s3.eu-central-1.amazonaws.com/ecoli_reads.tar.gz tar -xzvf ecoli_reads.tar.gz rm ecoli_reads.tar.gz Now you can simply run the image as an executable preceding the commands you would like to run within the container. E.g. running fastqc would look like: cd ./fastqc_0.11.9--hdfd78af_1.sif fastqc ./reads/ecoli_*.fastq.gz This will result in html files in the directory ./reads . These are quality reports for the sequence reads. If you\u2019d like to view them, you can download them with scp or e.g. FileZilla , and view them with your local browser.","title":"Running containers with singularity"},{"location":"course_material/day1/singularity/#learning-outcomes","text":"After having completed this chapter you will be able to: Login to a remote machine with ssh Use apptainer pull to convert an image from dockerhub to the \u2018apptainer image format\u2019 ( .sif ) Execute a apptainer container Explain the difference in default mounting behaviour between docker and apptainer Use apptainer shell to generate an interactive shell inside a .sif image Search and use images with both docker and apptainer from bioconda","title":"Learning outcomes"},{"location":"course_material/day1/singularity/#material","text":"Download the presentation Apptainer documentation Apptainer hub An article on Docker vs Apptainer Using conda and containers with snakemake","title":"Material"},{"location":"course_material/day1/singularity/#exercises","text":"","title":"Exercises"},{"location":"course_material/day1/singularity/#login-to-remote","text":"If you are enrolled in the course, you have received an e-mail with an IP, username, private key and password. To do the Apptainer exercises we will login to a remote server. Below you can find instructions on how to login. VScode is a code editor that can be used to edit files and run commands locally, but also on a remote server. In this subchapter we will set up VScode to work remotely. If not working with VScode If you are not working with VScode, you can login to the remote server with the following command: ssh -i key_username.pem If you want to edit files directly on the server, you can mount a directory with sshfs . Required installations For this exercise it is easiest if you use VScode . In addition you would need to have followed the instructions to set up remote-ssh: OpenSSH compatible client . This is usually pre-installed on your OS. You can check whether the command ssh exists. The Remote-SSH extension. To install, open VSCode and click on the extensions icon (four squares) on the left side of the window. Search for Remote-SSH and click on Install . Windows mac OS/Linux Open a PowerShell and cd to the directory where you have stored your private key. After that, move it to ~\\.ssh : mv .\\ key_username . pem ~\\. ssh Open a terminal, and cd to the directory where you have stored your private key. After that, change the file permissions of the key and move it to ~/.ssh : chmod 400 key_username.pem mv key_username.pem ~/.ssh Open VScode and click on the green or blue button in the bottom left corner. Select Connect to Host... , and then on Configure SSH Host... . Specify a the location for the config file. Use the same directory as where your keys are stored (so ~/.ssh ). A skeleton config file will be provided. Edit it, so it looks like this (replace username with your username, and specify the correct IP at HostName ): Windows MacOS/Linux Host sib_course_remote User username HostName 123.456.789.123 IdentityFile ~\\.ssh\\key_username.pem Host sib_course_remote User username HostName 123.456.789.123 IdentityFile ~/.ssh/key_username.pem Save and close the config file. Now click again the green or blue button in the bottom left corner. Select Connect to Host... , and then on sib_course_remote . You will be asked which operating system is used on the remote. Specify \u2018Linux\u2019.","title":"Login to remote"},{"location":"course_material/day1/singularity/#pulling-an-image","text":"Apptainer can take several image formats (e.g. a docker image), and convert them into it\u2019s own .sif format. Unlike docker this image doesn\u2019t live in a local image cache, but it\u2019s stored as an actual file. Exercise: On the remote server, pull the docker image that has the adjusted default CMD that we have pushed to dockerhub in this exercise ( ubuntu-figlet-df:v3 ) with apptainer pull . The syntax is: apptainer pull docker:// [ USER NAME ] / [ IMAGE NAME ] : [ TAG ] Answer apptainer pull docker:// [ USER NAME ] /ubuntu-figlet:v3 This will result in a file called ubuntu-figlet_v3.sif Note If you weren\u2019t able to push the image in the previous exercises to your docker hub, you can use geertvangeest as username to pull the image.","title":"Pulling an image"},{"location":"course_material/day1/singularity/#executing-an-image","text":"These .sif files can be run as standalone executables: ./ubuntu-figlet_v3.sif Note This is shorthand for: apptainer run ubuntu-figlet_v3.sif And you can overwrite the default command like this: apptainer run [ IMAGE NAME ] .sif [ COMMAND ] Note In this case, you can also use ./ [ IMAGE NAME ] .sif [ COMMAND ] However, most applications require apptainer run . Especially if you want to provide options like --bind (for mounting directories). Exercise: Run the .sif file without a command, and with a command that runs figlet . Do you get expected output? Do the same for the R image you\u2019ve created in the previous chapter. Entrypoint and apptainer The daterange image has an entrypoint set, and apptainer run does not overwrite it. In order to ignore both the entrypoint and cmd use apptainer exec . Answer Running it without a command ( ./ubuntu-figlet_v3.sif ) should give: __ __ _ _ _ | \\/ |_ _ (_)_ __ ___ __ _ __ _ ___ __ _____ _ __| | _____| | | |\\/| | | | | | | '_ ` _ \\ / _` |/ _` |/ _ \\ \\ \\ /\\ / / _ \\| '__| |/ / __| | | | | | |_| | | | | | | | | (_| | (_| | __/ \\ V V / (_) | | | <\\__ \\_| |_| |_|\\__, | |_|_| |_| |_|\\__,_|\\__, |\\___| \\_/\\_/ \\___/|_| |_|\\_\\___(_) |___/ |___/ Which is the default command that we changed in the Dockerfile . Running with a another figlet command: ./ubuntu-figlet_v3.sif figlet 'Something else' Should give: ____ _ _ _ _ / ___| ___ _ __ ___ ___| |_| |__ (_)_ __ __ _ ___| |___ ___ \\___ \\ / _ \\| '_ ` _ \\ / _ \\ __| '_ \\| | '_ \\ / _` | / _ \\ / __|/ _ \\ ___) | (_) | | | | | | __/ |_| | | | | | | | (_| | | __/ \\__ \\ __/ |____/ \\___/|_| |_| |_|\\___|\\__|_| |_|_|_| |_|\\__, | \\___|_|___/\\___| |___/ Pulling the deseq2 image: apptainer pull docker:// [ USER NAME ] /deseq2:v1 Running it without command: ./deseq2.sif Running with a command: ./deseq2.sif --rows 100 To overwrite both entrypoint and the command: apptainer exec deseq2.sif test_deseq2.R --rows 200","title":"Executing an image"},{"location":"course_material/day1/singularity/#mounting-with-apptainer","text":"Apptainer is also different from Docker in the way it handles mounting. By default, Apptainer binds your home directory and a number of paths in the root directory to the container. This results in behaviour that is almost like if you are working on the directory structure of the host. If your directory is not mounted by default It depends on the apptainer settings whether most directories are mounted by default to the container. If your directory is not mounted, you can do that with the --bind option of apptainer exec : apptainer exec --bind /my/dir/to/mount/ [ IMAGE NAME ] .sif [ COMMAND ] Running the command pwd (full name of current working directory) will therefore result in a path on the host machine: ./ubuntu-figlet_v3.sif pwd Exercise: Run the above command. What is the output? How would the output look like if you would run a similar command with Docker? Hint A similar Docker command would look like (run this on your local computer): docker run --rm ubuntu-figlet:v3 pwd Answer The output of ./ubuntu-figlet_v3.sif pwd is the current directory on the host: i.e. /home/username if you have it in your home directory. The output of docker run --rm ubuntu-figlet:v3 pwd (on the local host) would be / , which is the default workdir (root directory) of the container. As we did not mount any host directory, this directory exists only within the container (i.e. separated from the host).","title":"Mounting with Apptainer"},{"location":"course_material/day1/singularity/#interactive-shell","text":"If you want to debug or inspect an image, it can be helpful to have a shell inside the container. You can do that with apptainer shell : apptainer shell ubuntu-figlet_v3.sif Note To exit the shell type exit . Exercise: Can you run figlet inside this shell? Answer Yes: Apptainer> figlet test _ _ | |_ ___ ___| |_ | __/ _ \\/ __| __| | || __/\\__ \\ |_ \\__\\___||___/\\__| During the lecture you have learned that apptainer takes over the user privileges of the user on the host. You can get user information with command like whoami , id , groups etc. Exercise: Run the figlet container interactively. Do you have the same user privileges as if you were on the host? How is that with docker ? Answer A command like whoami will result in your username printed at stdout: Apptainer> whoami myusername Apptainer> id uid=1030(myusername) gid=1031(myusername) groups=1031(myusername),1001(condausers) Apptainer> groups myusername condausers With apptainer, you have the same privileges inside the apptainer container as on the host. If you do this in the docker container (based on the same image), you\u2019ll get output like this: root@a3d6e59dc19d:/# whoami root root@a3d6e59dc19d:/# groups root root@a3d6e59dc19d:/# id uid=0(root) gid=0(root) groups=0(root)","title":"Interactive shell"},{"location":"course_material/day1/singularity/#a-bioinformatics-example-extra","text":"All bioconda packages also have a pre-built container. Have a look at the bioconda website , and search for fastqc . In the search results, click on the appropriate record (i.e. package \u2018fastqc\u2019). Now, scroll down and find the namespace and tag for the latest fastqc image. Now we can pull it with apptainer like this: apptainer pull docker://quay.io/biocontainers/fastqc:0.11.9--hdfd78af_1 Let\u2019s test the image. Download some sample reads first: mkdir reads cd reads wget https://introduction-containers.s3.eu-central-1.amazonaws.com/ecoli_reads.tar.gz tar -xzvf ecoli_reads.tar.gz rm ecoli_reads.tar.gz Now you can simply run the image as an executable preceding the commands you would like to run within the container. E.g. running fastqc would look like: cd ./fastqc_0.11.9--hdfd78af_1.sif fastqc ./reads/ecoli_*.fastq.gz This will result in html files in the directory ./reads . These are quality reports for the sequence reads. If you\u2019d like to view them, you can download them with scp or e.g. FileZilla , and view them with your local browser.","title":"A bioinformatics example (extra)"},{"location":"course_material/day2/1_guidelines/","text":"Workshop goal Over the course of the workshop, you will implement and improve a workflow to trim bulk RNAseq reads, align them on a genome, perform some quality checks (QC), count mapped reads, and identify Differentially Expressed Genes (DEG). The goal of the workshop is that after the last series of exercises, you will have implemented a simple workflow with commonly used Snakemake features. You will be able to use this workflow as a reference to implement your own workflows in the future. Software All the software needed in this workflow is either: Already installed in the snake_course conda environment Already installed in a Docker container Will be installed via a conda environment during today\u2019s exercises Exercises Each series of exercises is divided in multiple questions. We first provide a general explanation on the context behind each question; we then explicitly describe the task and provide details when they are required. We also provide hints that should help you with the most challenging parts of some questions. You should first try to solve the problems without using these hints! Do not hesitate to modify and overwrite your code from previous questions when specified in an exercise, as the solutions for each series of exercises are provided. If something is not clear at any point, please call us and we will do our best to answer your questions. You can also check the official Snakemake documentation for more information.","title":"General guidelines"},{"location":"course_material/day2/1_guidelines/#workshop-goal","text":"Over the course of the workshop, you will implement and improve a workflow to trim bulk RNAseq reads, align them on a genome, perform some quality checks (QC), count mapped reads, and identify Differentially Expressed Genes (DEG). The goal of the workshop is that after the last series of exercises, you will have implemented a simple workflow with commonly used Snakemake features. You will be able to use this workflow as a reference to implement your own workflows in the future.","title":"Workshop goal"},{"location":"course_material/day2/1_guidelines/#software","text":"All the software needed in this workflow is either: Already installed in the snake_course conda environment Already installed in a Docker container Will be installed via a conda environment during today\u2019s exercises","title":"Software"},{"location":"course_material/day2/1_guidelines/#exercises","text":"Each series of exercises is divided in multiple questions. We first provide a general explanation on the context behind each question; we then explicitly describe the task and provide details when they are required. We also provide hints that should help you with the most challenging parts of some questions. You should first try to solve the problems without using these hints! Do not hesitate to modify and overwrite your code from previous questions when specified in an exercise, as the solutions for each series of exercises are provided. If something is not clear at any point, please call us and we will do our best to answer your questions. You can also check the official Snakemake documentation for more information.","title":"Exercises"},{"location":"course_material/day2/2_introduction_snakemake/","text":"Learning outcomes After having completed this chapter you will be able to: Understand the structure of a Snakemake workflow Write rules and Snakefiles to produce the desired outputs Chain rules together Run a Snakemake workflow Structuring a workflow It is advised to implement your code in a directory called workflow (more information about this in the next series of exercises). You are free to choose the names and location of files for the different steps of your workflow, but, for now, we recommend that you at least group all outputs from the workflow in a results folder within the workflow directory. A small reminder about conda environment If you try to run a command and get an error such as Command 'snakemake' not found , you are probably not in the right environment. To list them, use conda env list . Then activate the right environment with conda activate . You can deactivate an environment with conda deactivate . To list the packages installed in an environment, activate it and use conda list . Exercises This series of exercises will bear no biological meaning, on purpose: it is designed to show you the fundamentals of Snakemake. Creating a basic rule Rules are the basic blocks of a Snakemake workflow. A rule is like a recipe indicating how to produce a specific output . The actual application of a rule to create an output is called a job . A rule is defined in a Snakefile with the keyword rule and contains directives which indicate the rule\u2019s properties. To create the simplest rule possible, we need at least two directives : output : path of the output file for this rule shell : shell commands to execute in order to generate the output We will see other directives later in the course. Exercise: The following example shows the minimal syntax to implement a rule. What do you think it does? Does it create a file? If so, how is it called? rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' Answer This rule uses the echo shell command to print the line \u201csnakemake\u201d in an output file called first_step.txt , located in the results folder. Rules are defined and written in a file called Snakefile (note the capital S and the absence of extension in the filename). This file should be located at the root of the workflow directory (here, workflow/Snakefile ). Exercise: Create a Snakefile and copy the previous rule in it. Because the Snakemake language is built on top of Python, spaces and indents are essential, so do not forget to keep the indentation as is and use space characters in the indents instead of tabs. Executing a workflow with a precise output It is now time to execute your first worklow! To do this, you need to tell Snakemake what is your target, i.e. what is the output that you want to generate. Exercise: Execute the workflow with snakemake --cores 1 . What value should you use for ? Once Snakemake execution is finished, can you locate the output file? Answer Execute the workflow: snakemake --cores 1 results/first_step.txt Visualise the content of the results folder: ls -alh results/ Check the output content: cat results/first_step.txt Note that during the execution of the workflow, Snakemake automatically created the missing folder ( results/ ) in the output path. If several folders are missing (for example, test1/test2/test3/first_step.txt ), Snakemake will create all of them . Exercise: Re-run the exact same command. What happens? Answer Nothing! We get a message saying that Snakemake did not run anything: Building DAG of jobs... Nothing to be done (all requested files are present and up to date). By default, Snakemake only runs a job if: * A target file explicitly requested in the snakemake command is missing * An intermediate file is missing and is required produce a target file * It notices input files newer than output files, based on file modification dates. In this case, Snakemake will generate again the existing outputs. We can change this behaviour and force the re-run of a specific target by using the -f option: snakemake --cores 1 -f results/first_step.txt or force recreate ALL the outputs of the workflow using the -F option: snakemake --cores 1 -F . In practice, we can also alter Snakemake (re-)run policy, but we will not cover this topic in the course (see \u2013rerun-triggers option in Snakemake\u2019s CLI help and this git issue for more information). In the previous example, the values of the two rule directives are strings . For the shell directive (we will see other types of directive values later in the course), long string can be written on multiple lines for clarity, simply using a set of quotes for each line: rule first_step : output : 'results/first_step.txt' shell : 'echo \"I want to print a very very very very very very ' 'very very very very long string in my output\" > results/first_step.txt' Here, Snakemake will simply concatenate the two lines (paste each line one after the other) and execute the resulting command. Adding an input directive The next directive used by most rules is input . It indicates the path to a file that is required by the rule to generate the output. In the following example, we modified the previous rule to use the file previously created results/first_step.tsv as an input, and copy this file to results/second_step.txt : rule second_step : input : 'results/first_step.txt' output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' Note that with this rule definition, Snakemake will not run if results/first_step.tsv does not exist! Exercise: Modify your first rule to add an input directive and execute the workflow. Check that the output was created and that the files are identical. If you get a Missing input files for rule error, that means that the input file is missing and cannot be created. How can you solve this problem? Answer Execute the workflow: snakemake --cores 1 results/second_step.txt Visualise the content of the results folder: ls -alh results/ Check that the files are identical: diff results/first_step.txt results/second_step.txt If the input file is missing, you can create it with echo \u201csnakemake\u201d > results/first_step.txt and then execute the workflow. We will see later why this happened and how to avoid it! Creating a workflow with several rules Creating one Snakefile per rule does not seem like a good solution, so let\u2019s try to improve this. Exercise: Delete the results/ folder, copy the two previous rules ( first_step and second_step ) in the same Snakefile (place the first_step rule first) and try to run the workflow without specifying an output . What happens? Answer Delete the results folder: using the graphic interface or rm -rf results/ Execute the workflow without output: snakemake --cores 1 Only the first output, results/first_step.txt , is created. During its execution, Snakemake tries to generate a specific output called target and resolve all dependencies based on this target. A target can be any output that can be generated by any rule in the workflow. When you do not specify a target, the default one is the output of the first rule in the Snakefile, here results/first_step.txt of rule first_step . Exercise: With this in mind, instead of one target, use a space-separated list of targets in your command, to generate multiple targets. Use the -F to force the re-run of the whole workflow or delete your results/ folder beforehand. Answer Delete the results folder: using the graphic interface or rm -rf results/ Execute the workflow with multiple targets: snakemake --cores 1 results/first_step.txt results/second_step.txt We should now see Snakemake execute the 2 rules and produce both targets/outputs. Chaining rules Once again, writing all the outputs in the snakemake command does not look like a good solution: it is very time-consuming, error-prone (and annoying)! Imagine what happens when your workflow generate tens of outputs?! Fortunately, there is a way to simplify this, which relies on rules dependency. The core principle of Snakemake\u2019s execution is to compute a Directed Acyclic Graph (DAG) that summarizes dependencies between all the inputs and outputs required to generate the final desired outputs. For each job, starting from the jobs generating the final outputs, Snakemake checks if the required inputs exist. If they do not, the software looks for a rule that generates these inputs. This process is repeated until all dependencies are resolved. This is why Snakemake is said to have a \u2018bottom-up\u2019 approach: it starts from the last outputs and go back to the first inputs. Hint Your Snakefile should look like this: rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' rule second_step : input : 'results/first_step.txt' output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' Exercise: Delete the results/ folder, identify your final output(s) and execute the workflow specifying only this(these) output(s) in the command. Answer Delete the results folder: using the graphic interface or rm -rf results/ Execute the workflow: snakemake --cores 1 results/second_step.txt Visualise the content of the results folder: ls -alh results/ You should now see Snakemake executing the two rules and producing both outputs. To generate the output results/second_step.txt , Snakemake requires the input results/first_step.txt . Before the workflow is executed, this file does not exist, therefore, Snakemake looks for a rule that generates results/first_step.txt , in this case the rule first_step . The process is then repeated for first_step . In this case, the rule does not require any input, so all dependencies are resolved, and Snakemake can generate the DAG. Important notes on rules dependency Rules must produce unique outputs Because of the rules dependency process, by default, an output can only be generated by a single rule. Otherwise, Snakemake cannot decide which rule to use to generate this output, and the rules are considered ambiguous . In practice, there are ways to deal with ambiguous rules, but they should be avoided as much as possible and we will not cover them in this course (see the relevant section in the official documentation for more information). Rules dependency can be written more easily It is possible to refer to the output of a rule directly in another rule with the syntax rules..output . Note that you don\u2019t need quotes around this statement, because it is a Snakemake object. The following example implements this syntax for the two rule defined above: rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' rule second_step : input : rules . first_step . output output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' This method has several advantages, among which: It limits the risk of error because we do not have to write the same filename at several locations A change in output name will be automatically propagated to rules that depend on it, i.e. the name only has to be changed once This makes the code much clearer and easier to understand: with this syntax, we instantly know the object type ( rule ), how/where it is created ( first_step ), and what it is ( output )","title":"Introduction to Snakemake"},{"location":"course_material/day2/2_introduction_snakemake/#learning-outcomes","text":"After having completed this chapter you will be able to: Understand the structure of a Snakemake workflow Write rules and Snakefiles to produce the desired outputs Chain rules together Run a Snakemake workflow","title":"Learning outcomes"},{"location":"course_material/day2/2_introduction_snakemake/#structuring-a-workflow","text":"It is advised to implement your code in a directory called workflow (more information about this in the next series of exercises). You are free to choose the names and location of files for the different steps of your workflow, but, for now, we recommend that you at least group all outputs from the workflow in a results folder within the workflow directory. A small reminder about conda environment If you try to run a command and get an error such as Command 'snakemake' not found , you are probably not in the right environment. To list them, use conda env list . Then activate the right environment with conda activate . You can deactivate an environment with conda deactivate . To list the packages installed in an environment, activate it and use conda list .","title":"Structuring a workflow"},{"location":"course_material/day2/2_introduction_snakemake/#exercises","text":"This series of exercises will bear no biological meaning, on purpose: it is designed to show you the fundamentals of Snakemake.","title":"Exercises"},{"location":"course_material/day2/2_introduction_snakemake/#creating-a-basic-rule","text":"Rules are the basic blocks of a Snakemake workflow. A rule is like a recipe indicating how to produce a specific output . The actual application of a rule to create an output is called a job . A rule is defined in a Snakefile with the keyword rule and contains directives which indicate the rule\u2019s properties. To create the simplest rule possible, we need at least two directives : output : path of the output file for this rule shell : shell commands to execute in order to generate the output We will see other directives later in the course. Exercise: The following example shows the minimal syntax to implement a rule. What do you think it does? Does it create a file? If so, how is it called? rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' Answer This rule uses the echo shell command to print the line \u201csnakemake\u201d in an output file called first_step.txt , located in the results folder. Rules are defined and written in a file called Snakefile (note the capital S and the absence of extension in the filename). This file should be located at the root of the workflow directory (here, workflow/Snakefile ). Exercise: Create a Snakefile and copy the previous rule in it. Because the Snakemake language is built on top of Python, spaces and indents are essential, so do not forget to keep the indentation as is and use space characters in the indents instead of tabs.","title":"Creating a basic rule"},{"location":"course_material/day2/2_introduction_snakemake/#executing-a-workflow-with-a-precise-output","text":"It is now time to execute your first worklow! To do this, you need to tell Snakemake what is your target, i.e. what is the output that you want to generate. Exercise: Execute the workflow with snakemake --cores 1 . What value should you use for ? Once Snakemake execution is finished, can you locate the output file? Answer Execute the workflow: snakemake --cores 1 results/first_step.txt Visualise the content of the results folder: ls -alh results/ Check the output content: cat results/first_step.txt Note that during the execution of the workflow, Snakemake automatically created the missing folder ( results/ ) in the output path. If several folders are missing (for example, test1/test2/test3/first_step.txt ), Snakemake will create all of them . Exercise: Re-run the exact same command. What happens? Answer Nothing! We get a message saying that Snakemake did not run anything: Building DAG of jobs... Nothing to be done (all requested files are present and up to date). By default, Snakemake only runs a job if: * A target file explicitly requested in the snakemake command is missing * An intermediate file is missing and is required produce a target file * It notices input files newer than output files, based on file modification dates. In this case, Snakemake will generate again the existing outputs. We can change this behaviour and force the re-run of a specific target by using the -f option: snakemake --cores 1 -f results/first_step.txt or force recreate ALL the outputs of the workflow using the -F option: snakemake --cores 1 -F . In practice, we can also alter Snakemake (re-)run policy, but we will not cover this topic in the course (see \u2013rerun-triggers option in Snakemake\u2019s CLI help and this git issue for more information). In the previous example, the values of the two rule directives are strings . For the shell directive (we will see other types of directive values later in the course), long string can be written on multiple lines for clarity, simply using a set of quotes for each line: rule first_step : output : 'results/first_step.txt' shell : 'echo \"I want to print a very very very very very very ' 'very very very very long string in my output\" > results/first_step.txt' Here, Snakemake will simply concatenate the two lines (paste each line one after the other) and execute the resulting command.","title":"Executing a workflow with a precise output"},{"location":"course_material/day2/2_introduction_snakemake/#adding-an-input-directive","text":"The next directive used by most rules is input . It indicates the path to a file that is required by the rule to generate the output. In the following example, we modified the previous rule to use the file previously created results/first_step.tsv as an input, and copy this file to results/second_step.txt : rule second_step : input : 'results/first_step.txt' output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' Note that with this rule definition, Snakemake will not run if results/first_step.tsv does not exist! Exercise: Modify your first rule to add an input directive and execute the workflow. Check that the output was created and that the files are identical. If you get a Missing input files for rule error, that means that the input file is missing and cannot be created. How can you solve this problem? Answer Execute the workflow: snakemake --cores 1 results/second_step.txt Visualise the content of the results folder: ls -alh results/ Check that the files are identical: diff results/first_step.txt results/second_step.txt If the input file is missing, you can create it with echo \u201csnakemake\u201d > results/first_step.txt and then execute the workflow. We will see later why this happened and how to avoid it!","title":"Adding an input directive"},{"location":"course_material/day2/2_introduction_snakemake/#creating-a-workflow-with-several-rules","text":"Creating one Snakefile per rule does not seem like a good solution, so let\u2019s try to improve this. Exercise: Delete the results/ folder, copy the two previous rules ( first_step and second_step ) in the same Snakefile (place the first_step rule first) and try to run the workflow without specifying an output . What happens? Answer Delete the results folder: using the graphic interface or rm -rf results/ Execute the workflow without output: snakemake --cores 1 Only the first output, results/first_step.txt , is created. During its execution, Snakemake tries to generate a specific output called target and resolve all dependencies based on this target. A target can be any output that can be generated by any rule in the workflow. When you do not specify a target, the default one is the output of the first rule in the Snakefile, here results/first_step.txt of rule first_step . Exercise: With this in mind, instead of one target, use a space-separated list of targets in your command, to generate multiple targets. Use the -F to force the re-run of the whole workflow or delete your results/ folder beforehand. Answer Delete the results folder: using the graphic interface or rm -rf results/ Execute the workflow with multiple targets: snakemake --cores 1 results/first_step.txt results/second_step.txt We should now see Snakemake execute the 2 rules and produce both targets/outputs.","title":"Creating a workflow with several rules"},{"location":"course_material/day2/2_introduction_snakemake/#chaining-rules","text":"Once again, writing all the outputs in the snakemake command does not look like a good solution: it is very time-consuming, error-prone (and annoying)! Imagine what happens when your workflow generate tens of outputs?! Fortunately, there is a way to simplify this, which relies on rules dependency. The core principle of Snakemake\u2019s execution is to compute a Directed Acyclic Graph (DAG) that summarizes dependencies between all the inputs and outputs required to generate the final desired outputs. For each job, starting from the jobs generating the final outputs, Snakemake checks if the required inputs exist. If they do not, the software looks for a rule that generates these inputs. This process is repeated until all dependencies are resolved. This is why Snakemake is said to have a \u2018bottom-up\u2019 approach: it starts from the last outputs and go back to the first inputs. Hint Your Snakefile should look like this: rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' rule second_step : input : 'results/first_step.txt' output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' Exercise: Delete the results/ folder, identify your final output(s) and execute the workflow specifying only this(these) output(s) in the command. Answer Delete the results folder: using the graphic interface or rm -rf results/ Execute the workflow: snakemake --cores 1 results/second_step.txt Visualise the content of the results folder: ls -alh results/ You should now see Snakemake executing the two rules and producing both outputs. To generate the output results/second_step.txt , Snakemake requires the input results/first_step.txt . Before the workflow is executed, this file does not exist, therefore, Snakemake looks for a rule that generates results/first_step.txt , in this case the rule first_step . The process is then repeated for first_step . In this case, the rule does not require any input, so all dependencies are resolved, and Snakemake can generate the DAG.","title":"Chaining rules"},{"location":"course_material/day2/2_introduction_snakemake/#important-notes-on-rules-dependency","text":"","title":"Important notes on rules dependency"},{"location":"course_material/day2/2_introduction_snakemake/#rules-must-produce-unique-outputs","text":"Because of the rules dependency process, by default, an output can only be generated by a single rule. Otherwise, Snakemake cannot decide which rule to use to generate this output, and the rules are considered ambiguous . In practice, there are ways to deal with ambiguous rules, but they should be avoided as much as possible and we will not cover them in this course (see the relevant section in the official documentation for more information).","title":"Rules must produce unique outputs"},{"location":"course_material/day2/2_introduction_snakemake/#rules-dependency-can-be-written-more-easily","text":"It is possible to refer to the output of a rule directly in another rule with the syntax rules..output . Note that you don\u2019t need quotes around this statement, because it is a Snakemake object. The following example implements this syntax for the two rule defined above: rule first_step : output : 'results/first_step.txt' shell : 'echo \u201csnakemake\u201d > results/first_step.txt' rule second_step : input : rules . first_step . output output : 'results/second_step.txt' shell : 'cp results/first_step.txt results/second_step.txt' This method has several advantages, among which: It limits the risk of error because we do not have to write the same filename at several locations A change in output name will be automatically propagated to rules that depend on it, i.e. the name only has to be changed once This makes the code much clearer and easier to understand: with this syntax, we instantly know the object type ( rule ), how/where it is created ( first_step ), and what it is ( output )","title":"Rules dependency can be written more easily"},{"location":"course_material/day2/3_generalising_snakemake/","text":"Learning outcomes After having completed this chapter you will be able to: Create rules with multiple inputs and outputs Make the code shorter and more general by using placeholders and wildcards Optimise the memory usage of a workflow and checking its performances Visualise a workflow DAG Data origin The data we will use during the exercises was produced in this work . Briefly, the team studied the transcriptional response of a strain of baker\u2019s yeast, Saccharomyces cerevisiae , facing environments with different amount of CO 2 . To this end, they performed 150 bp paired-end sequencing of mRNA-enriched samples. Detailed information on all the samples are available here , but just know that for the purpose of the course, we selected 6 samples ( 3 replicates per condition , low and high CO 2 ) and down-sampled them to 1 million read-pairs each to reduce computation times. Exercises One of the aims of today\u2019s course is to develop a basic, yet efficient, workflow to analyse RNAseq data. This workflow takes reads coming from RNA sequencing as inputs and outputs a list of genes that are differentially expressed between two conditions. The files containing the reads are in FASTQ format and the output will be a tab-separated file containing a list of genes with expression changes, results of statistical tests\u2026 In this series of exercises, we will create the \u2018backbone\u2019 of the workflow, i.e. the rules that are the most computationally expensive, namely: A rule to trim poor-quality reads A rule to map the trimmed reads on a reference genome A rule to convert and sort files from the SAM format to the BAM format A rule to count the reads mapping on each gene At the end of this series of exercises, the DAG of your workflow should look like this: ![backbone_rulegraph](../../../assets/images/backbone_rulegraph.png) Rulegraph of the workflow at the end of the session Designing and debugging a workflow If you have problems designing your Snakemake workflow or debugging it, you can find some help here . General instructions and reminders In each rule, you should try (as much as possible) to: Choose meaningful rule names Use rules dependency, with the syntax rules..output If you use numbered outputs, the syntax becomes rules..output[n] (with n starting at 0) If you use named outputs, the syntax becomes rules..output. Use placeholders Use wildcards Choose meaningful wildcard names The output , log , and benchmark directives must have the same wildcard names! You can use the same wildcard names in multiple rules for consistency and readability, but Snakemake will treat them as independent wildcards and their values will not be shared: rules are self-contained and wildcards are local to each rule ( see a very nice summary on wildcards ) Use multiple inputs/outputs (when needed/possible) Create a log file with the log directive Create a benchmark file with the benchmark directive If you have a doubt, do not hesitate to test your workflow logic with a dry-run (the -n flag): snakemake --cores 1 -n . Snakemake will then display all the jobs required to generate the target. To obtain additional information on why a specific job is necessary, run Snakemake with the -r flag (which can be -and usually is- combined with -n ): snakemake --cores 1 -n -r . For each job, Snakemake will print a reason field explaining why the job was required. To visualize the exact command executed by each job (with the placeholders and wildcards replaced by their values), run snakemake with the -p flag: snakemake --cores 1 -n -r -p . Downloading the data and setting up the directory structure In this part, we will download the data and start building the directory structure of our workflow according to the official recommendations . We already starting doing so in the previous series of exercises and ultimately, it should resemble this: \u2502\u2500\u2500 .gitignore \u2502\u2500\u2500 README.md \u2502\u2500\u2500 LICENSE.md \u2502\u2500\u2500 benchmarks \u2502 \u2502\u2500\u2500 sample1.fastq \u2502 \u2514\u2500\u2500 sample2.fastq \u2502\u2500\u2500 config \u2502 \u2502\u2500\u2500 config.yaml \u2502 \u2514\u2500\u2500 some-sheet.tsv \u2502\u2500\u2500 data \u2502 \u2502\u2500\u2500 sample1.fastq \u2502 \u2514\u2500\u2500 sample2.fastq \u2502\u2500\u2500 images \u2502 \u2514\u2500\u2500 rulegraph.svg \u2502\u2500\u2500 logs \u2502 \u2502\u2500\u2500 sample1.log \u2502 \u2514\u2500\u2500 sample2.log \u2502\u2500\u2500 results \u2502 \u2502\u2500\u2500 sample1 \u2502 \u2502 \u2514\u2500\u2500 sample1.bam \u2502 \u2502\u2500\u2500 sample2 \u2502 \u2502 \u2514\u2500\u2500 sample2.bam \u2502 \u2514\u2500\u2500 DEG_list.tsv \u2502\u2500\u2500 resources \u2502 \u2502\u2500\u2500 Scerevisiae.fasta \u2502 \u2514\u2500\u2500 Scerevisiae.gtf \u2514\u2500\u2500 workflow \u2502\u2500\u2500 envs \u2502 \u2502\u2500\u2500 tool1.yaml \u2502 \u2514\u2500\u2500 tool2.yaml \u2502\u2500\u2500 rules \u2502 \u2502\u2500\u2500 module1.smk \u2502 \u2514\u2500\u2500 module2.smk \u2502\u2500\u2500 scripts \u2502 \u2502\u2500\u2500 script1.py \u2502 \u2514\u2500\u2500 script2.R \u2514\u2500\u2500 Snakefile For now, the main thing to remember is that the workflow code goes into a subfolder called workflow and the rest is mostly input/output files, except for the config subfolder, which will be explained later. All output files generated in the workflow should be stored under results/ . Now, let\u2019s download the data, uncompress it and build the first part of the directory structure. wget https://apollo.vital-it.ch/trackvis/snakemake_rnaseq.tar.gz # Download the data tar -xvf snakemake_rnaseq.tar.gz # Uncompress the archive rm snakemake_rnaseq.tar.gz # Delete the archive cd snakemake_rnaseq/ # Start developing in a new folder In this new folder, you should now see 2 subfolders: data/ , which contains the data to analyse resources/ , which contains retrieved resources, here the assembly, the genome indices and the annotation file of S. cerevisiae . It may also contain small resources delivered along with the workflow via git Let\u2019s create another subfolder, this time to host all the files containing the code, as well as the Snakefile: mkdir workflow # Create a new folder touch workflow/Snakefile # Create an empty Snakefile The Snakefile marks the entrypoint of the workflow. It will be automatically discovered when running Snakemake from the root of the structure, here snakemake_rnaseq/ . We can also tell Snakemake to use a specific Snakefile with the -s flag: snakemake --cores 1 -s , but it is highly discouraged as it hampers reproducibility. If you followed the general instructions , Snakemake should create all the other missing folders by itself (except one that you will discover at the end of this series of exercises), so it is now time to create the rules mentioned earlier . Have a look here for a few pieces of advice on workflow design. \u2018bottom-up\u2019 or \u2018top-down\u2019 development? Even if it is often easier to start from the final outputs and work backwards to the first inputs, the next exercises are presented in the opposite direction (first inputs to last outputs) to make the session easier to understand. That being said, feel free to work and develop your code in the order you prefer! Even if we asked you to use wildards, do not try to process all the samples yet. Choose and work with one sample (which means two .fastq files because reads are paired-end) in this series of exercises. We will see an efficient way to process list of files in the next series of exercises. Creating a rule to trim reads Usually, the first step in dealing with sequencing data is to improve the reads quality by removing low quality bases, stretches of As and Ns and reads that are too short. Adapters trimming In theory, trimming also removes sequencing adapters, but we will not do it here to keep computation time low and avoid having to parse other files to extract the adapter sequences. Exercise: Implement a rule to trim the reads contained in .fastq files using atropos . Hint You can find information on how to use atropos and its parameters with atropos trim -h The files to trim are located in data/ The base of the trimming command is atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" If you are interested in what these options mean, see below for an explanation The paths of the files to trim ( i.e. input files, in FASTQ format) are specified with the options -pe1 (first read) and -pe2 (second read) The paths of the trimmed files ( i.e. output files, also in FASTQ format) are specified with the options -o (first read) and -p (second read) atropos outputs some information as well as its trimming report in the terminal (stdout to be exact); do not forget to redirect these information to the log file with >> {log} Please give it a try before looking at the answer! Answer This is one way of writing this rule, but definitely not the only way! This is true for all the rules presented here. rule fastq_trim : ''' This rule trims paired-end reads to improve their quality. Specifically, it removes: - Low quality bases - A stretches longer than 20 bases - N stretches ''' input : reads1 = 'data/ {sample} _1.fastq' , reads2 = 'data/ {sample} _2.fastq' , output : trim1 = 'results/ {sample} / {sample} _atropos_trimmed_1.fastq' , trim2 = 'results/ {sample} / {sample} _atropos_trimmed_2.fastq' log : 'logs/ {sample} / {sample} _atropos_trimming.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_trimming.txt' resources : mem_mb = 500 shell : ''' echo \"Trimming reads in <{input.reads1}> and <{input.reads2}>\" > {log} atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 \\ --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" \\ -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo \"Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively\" >> {log} echo \"Trimming report saved in <{log}>\" >> {log} ''' Note the three things that are happening here: We used the {sample} wildcards twice in the output paths. This is because we prefer to have all the files linked to a sample in the same directory We added a memory limit for this job: 500 MB. Because we have limited resources in this server compared to a High Performance Computing cluster (HPC), this will help Snakemake to better allocate resources and parallelise jobs. You can determine the maximum amount of memory used by a rule thanks to the max_rss column in a benchmark result (results are shown in MB). More information here We used a backslash \\ to split a very long line in smaller lines. This is purely \u2018cosmetic\u2019, to avoid very long lines that are painful to read, copy\u2026 Paths in Snakemake All the paths in the Snakefile are relative to the working directory in which the snakemake command is executed. If you execute Snakemake in snakemake_rnaseq/ , the relative path to the input files in the rule is data/.fastq If you execute Snakemake in snakemake_rnaseq/workflow/ , the relative path to the input files in the rule is ../data/.fastq Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_atropos_trimmed_1.fastq If you run it now, don\u2019t forget to have a look at the log and benchmark files! atropos options -q 20,20 : trim low-quality bases from 5\u2019, 3\u2019 ends of each read before adapter removal --minimum-length 25 : discard trimmed reads that are shorter than 25 bp --trim-n : trim N\u2019s on ends of reads --preserve-order : preserve order of reads in input files --max-n 10 : discard reads with more than 10 N --no-cache-adapters : do not cache adapters list as \u2018.adapters\u2019 in the working directory -a \"A{{20}}\" -A \"A{{20}}\" : remove series of 20 As in the adapter sequence ( -a for the first read of the pair, -A for the second one) The usual command-line syntax is -a \"A{20}\" . Here, brackets were doubled to prevent Snakemake from interpreting {20} as a wildcard Creating a rule to map trimmed reads on a reference genome Once we have trimmed reads, the next step is to map those reads onto a reference assembly, here S. cerevisiae strain S288C, to eventually obtain read counts. The assembly used in this exercise is RefSeq GCF_000146045.2 and was retrieved via the NCBI genome website. Exercise: Implement a rule to map the trimmed reads on the S. cerevisiae assembly using HISAT2 . HISAT2 genome index To align reads on a genome, HISAT2 relies on a graph-based index. We built the genome index for you, using the command hisat2-build -p 24 -f Scerevisiae.fasta resources/genome_indices/Scerevisiae_index . -p is the number of threads to use, -f is the genomic sequence in FASTA format and Scerevisiae_genome_index is the global name shared by all the index files. Hint You can find information on how to use HISAT2 and its parameters with hisat2 -h The base of the mapping command is hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal If you are interested in what these options mean, see below for an explanation The path of the genome indices ( i.e. input files, in binary format) is specified with the option -x . The files have a shared title of resources/genome_indices/Scerevisiae_genome_index , which is the value you need to use for -x The paths of the trimmed files ( i.e. input files) are specified with the options -1 (first read) and -2 (second read) The path of the mapped reads file ( i.e. output file, in SAM format) is specified with the option -S (do not forget the .sam extension to the filename) The path of the mapping report ( i.e. output file, in text format) is specified with the option --summary-file HISAT2 also outputs information in the terminal (stderr to be exact); do not forget to redirect these to the log file with 2>> {log} This step is the longest of the workflow. With the current settings, it should take ~6 min to complete. If you decide to run it now, you should launch it and start working on the next rules Please give it a try before looking at the answer! Answer rule read_mapping : ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input : trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : sam = 'results/ {sample} / {sample} _mapped_reads.sam' , report = 'results/ {sample} / {sample} _mapping_report.txt' log : 'logs/ {sample} / {sample} _mapping.log' benchmark : 'benchmarks/ {sample} / {sample} _mapping.txt' resources : mem_gb = 2 shell : ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \\ -x resources/genome_indices/Scerevisiae_genome_index \\ -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads.sam If you run it now, don\u2019t forget to have a look at the log and benchmark files! HISAT2 options --dta : report alignments tailored for transcript assemblers --fr : set alignment of -1, -2 mates to forward/reverse (position of reads in a pair relatively to each other) --no-mixed : remove unpaired alignments for paired reads --no-discordant : remove discordant alignments for paired reads --time : print wall-clock time taken by search phases --new-summary : print alignment summary in a new style --no-unal : suppress SAM records for reads that failed to align Creating a rule to convert and sort SAM files to BAM HISAT2 only outputs mapped reads in the SAM format . However, most downstream analysis tools use the BAM format , which is the compressed binary version of the SAM format and, as such, is much smaller, easier to manipulate and transfer and allows a faster data retrieval. Additionally, many analyses require that BAM files are sorted by genomic coordinates and indexed, because sorted BAM files can be processed much more easily and quickly than unsorted ones. Alignment data files More information on alignment data files and other formats on the official github repository of the formats. Exercise: Implement a single rule to: Convert SAM files to BAM using Samtools Sort the BAM files using Samtools Index the sorted BAM files using Samtools Hint You can find information on how to use Samtools and its parameters with samtools --help You need to write 3 commands that will be executed sequentially: the output of command 1 will be the input of command 2 etc\u2026 No panic! These commands are pretty simple and do not use many options! To convert SAM format to the BAM format, use the command samtools view -b -o To sort a BAM file, use the command samtools sort -O bam -o To index a BAM file, use the command samtools index -b -o The index must have the exact same name than its associated BAM file, except it finishes with the extension .bam.bai instead of .bam If you are interested in what these options mean, see below for an explanation To catch potential information and errors, do not forget to redirect stderr to the log file with 2>> {log} Please give it a try before looking at the answer! Answer rule sam_to_bam : ''' This rule converts a sam file to bam format, sorts it and indexes it. ''' input : sam = rules . read_mapping . output . sam output : bam = 'results/ {sample} / {sample} _mapped_reads.bam' , bam_sorted = 'results/ {sample} / {sample} _mapped_reads_sorted.bam' , index = 'results/ {sample} / {sample} _mapped_reads_sorted.bam.bai' log : 'logs/ {sample} / {sample} _mapping_sam_to_bam.log' benchmark : 'benchmarks/ {sample} / {sample} _mapping_sam_to_bam.txt' resources : mem_mb = 250 shell : ''' echo \"Converting <{input.sam}> to BAM format\" > {log} samtools view {input.sam} -b -o {output.bam} 2>> {log} echo \"Sorting BAM file\" >> {log} samtools sort {output.bam} -O bam -o {output.bam_sorted} 2>> {log} echo \"Indexing the sorted BAM file\" >> {log} samtools index -b {output.bam_sorted} -o {output.index} 2>> {log} echo \"Sorted file saved in <{output.bam_sorted}>\" >> {log} ''' Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads_sorted.bam If you run it now, don\u2019t forget to have a look at the log and benchmark files! Samtools options samtools view -b : flag to tell Samtools to create an output in BAM format -o : path of the output file samtools view -O bam : flag to tell Samtools to create an output in BAM format -o : path of the output file samtools index -b : flag to tell Samtools to create an index in BAI format Creating a rule to count mapped reads Most of the analyses happening downstream the alignment step, including Differential Expression Analyses, are starting off read counts, either by exon or gene. However, we are still missing those counts! Counting reads on exons/genes To count reads mapping on genomic features, we first need a definition of those features. In this case, we picked one of the best-known model organism, S. cerevisiae , which has been annotated for a long time. These annotations are easily available on the NCBI or the Saccharomyces Genome Database . If your organism has not been annotated yet, there are ways to work around this problem, but this is an entirely different field that we won\u2019t discuss here! Chromosome names If you are working with genome sequences and annotations from different sources, remember that they must contain the chromosome names, otherwise counting will not work. Exercise: Implement a rule to count the reads mapping on each gene of the S. cerevisiae genome using featureCounts . Hint You can find information on how to use featureCounts and its parameters with featureCounts -h The base of the mapping command is featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF If you are interested in what these options mean, see below for an explanation The path of the file containing the annotations ( i.e. input files, in GTF format) is specified with the -a option. This file is located at resources/Scerevisiae.gtf There are two main annotations format: GTF and GFF . The former is lighter and easier to work with, so that is the one we will use The paths of the sorted BAM file(s) ( i.e. input file(s)) are not specified with an option, they are simply added at the end of the command The path of the file containing the count results ( i.e. output file, in tsv format) is specified with the option -o featureCounts will also output a separate file (in tsv format) including summary statistics of counting results, with the name .summary. For example, if the output is test.tsv , the summary will be printed in test.tsv.summary . Do not forget this output in your rule featureCounts also outputs information in the terminal (stderr to be exact); do not forget to redirect these to the log file with 2>> {log} Please give it a try before looking at the answer! Answer rule reads_quantification_genes : ''' This rule quantifies the reads of a bam file mapping on genes and produces a count table for all genes of the assembly. ''' input : bam_once_sorted = rules . sam_to_bam . output . bam_sorted , output : gene_level = 'results/ {sample} / {sample} _genes_read_quantification.tsv' , gene_summary = 'results/ {sample} / {sample} _genes_read_quantification.summary' log : 'logs/ {sample} / {sample} _genes_read_quantification.log' benchmark : 'benchmarks/ {sample} / {sample} _genes_read_quantification.txt' resources : mem_mb = 500 shell : ''' echo \"Counting reads mapping on genes in <{input.bam_once_sorted}>\" > {log} featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \\ -a resources/Scerevisiae.gtf -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo \"Renaming output files\" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo \"Results saved in <{output.gene_level}>\" >> {log} echo \"Report saved in <{output.gene_summary}>\" >> {log} ''' featureCounts options -t : specify on which feature type to count the reads -g : specify if and how to gather feature counts. Here, reads are counted by exon ( -t ) and the exon counts are gathered by genes \u2018meta-features\u2019 ( -g ) -s : perform strand-specific read counting Strandedness is determined by looking at the mRNA library preparation kit. It can also be determined a posteriori with scripts such as infer_experiment.py from the RSeQC package -p : count fragments instead of reads. If you don\u2019t use this option with paired-end reads, featureCounts won\u2019t be able to assign the read-pairs to features -B : only count read pairs that have both ends aligned -C : do not count read pairs that have their two ends mapping to different chromosomes or mapping on the same chromosome but on different strands --largestOverlap : assign reads to the meta-feature/feature that has the largest number of overlapping bases -F : specify format of the provided annotation file --verbose : output verbose information, such as unmatched chromosome/contig names Running the workflow Exercise: If you have not done it after each step, it is now time to run the entire workflow on your sample of choice. What command will you use to run it? Answer Because all the rules are chained together, you only need to specify one of the final outputs to trigger the execution of all the previous rules: snakemake --cores 1 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv . You can add the -F flag to force an entire re-run. The entire run should take about ~10 min to complete. Exercise: Check Snakemake\u2019s log in .snakemake/log/ . Is everything as you expected, especially the wildcards values, input and output names etc\u2026? Answer cat .snakemake/log/ Visualising the DAG of the workflow We have now implemented and run the main steps of our workflow. It is always a good idea to visualise the whole process to check for errors and inconsistencies. Snakemake\u2019s has a built-in workflow visualisation feature to do this. Exercise: Visualise the entire workflow\u2019s Directed Acyclic Graph using the --dag flag. Do you need to specify a target? Hint Try to follow the official recommendations on workflow structure, which states that images are supposed to go in the images/ subfolder Snakemake prints a DAG in text format, so we need to use the dot command to transform it into a picture Save the result as a PNG picture Answer If we run the command without target: snakemake --cores 1 --dag -F | dot -Tpng > images/dag.png , we will get a Target rules may not contain wildcards. error, which means we need to add a target. Same as before, it makes sense to use one of the final outputs to get the entire workflow: snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png . But once again, we will get an error: BrokenPipeError: [Errno 32] Broken pipe . This is because we are piping the command output to a folder ( images/ ) that does not exist yet The folder is not created by Snakemake because it isn\u2019t handled as part of an actual run. So we have to create the folder before generating the DAG: mkdir images snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png Some explanations on the command: -F : force to show the entire worklow and ensures all jobs are shown. You can also use -f to show fewer jobs dot : tool that is a part of the graphviz package and is used to draw hierarchical or layered drawings of directed graphs, i.e. graphs in which edges (arrows) have a direction -T : control the image format. Available formats are listed here DAG aspect If you already computed all the outputs of the workflow, steps in the DAG will have dotted lines. To visualise the DAG before running the workflow, add -F/--forceall to the snakemake command to force the execution of all jobs. DAG = dry-run The --dag flag implicitly activates the --dry-run/--dryrun/-n option, which means that no jobs are executed during the plot creation. There are actually 3 types of DAG: A DAG, created with the --dag option A filegraph, created with the --filegraph option A rulegraph, created with the --rulegraph option Exercise: Generate the filegraph and rulegraph of your workflow. Feel free to try different pictures format. What are the differences between the plots? Answer Generate the rulegraph: snakemake --cores 1 --rulegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpdf > images/rulegraph.pdf Generate the filegraph: snakemake --cores 1 --filegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tjpg > images/filegraph.jpg You should obtain the 3 following figures: ![backbone_dag](../../../assets/images/backbone_dag.png){ width=\"30%\" height=\"450\" } ![backbone_rulegraph](../../../assets/images/backbone_rulegraph.png){ width=\"30%\" height=\"450\" } ![backbone_filegraph](../../../assets/images/backbone_filegraph.png){ width=\"30%\" height=\"450\" } DAG, rulegraph and filegraph (respectively) of the workflow at the end of the session The differences between these plots are: --dag : dependency graph of all the jobs --filegraph : dependency graph of rules with inputs and outputs (rule appears once, with wildcards) --rulegraph : dependency graph of rules (rule appears once) Designing a Snakemake workflow\u2026 and debugging it! Designing a workflow There are many ways to design a new workflow, but these few pieces of advice will be useful in most cases: Start with a pen and paper: try to find out how many rules you will need and how they depend on each other. In other terms, start by sketching the DAG of your workflow! Remember that Snakemake has a bottom-up approach (it goes from the final outputs to the first input), so it may be easier for you to work in that order as well and write your last rule first Determine which rules (if any) aggregate or split inputs and create input functions accordingly (we will see how these functions work in session 4) Make sure your input and output directives are right before worrying about anything else, especially the shell sections. Remember that Snakemake builds the DAG before running the shell commands, so you can use the --dryrun option to test the workflow before running it. You can even do that without writing all the shell commands! List any parameters or settings that might need to be adjusted Choose meaningful and easy-to-understand names for your inputs, outputs, parameters, wildcards\u2026 to make your Snakefile as readable as possible. This is true for every script, piece of code, variable etc\u2026 and Snakemake is no exception! Have a look at The Zen of Python for more information Debugging a workflow It is very likely you will see bugs and errors the first time you try to run a new Snakefile: don\u2019t be discouraged, this is normal! Order of operations in Snakemake The topic was tackled when DAGs were mentioned, but to efficiently debug a workflow, it is worth taking a deeper look at what Snakemake does when you execute the command snakemake --cores 1 . There are 3 main phases: Prepare to run: Read all the rule definitions from the Snakefile Resolve the DAG (when Snakemake says \u2018Building DAG of jobs\u2019): Check what output(s) are required Look for a matching rule by looking at the outputs of all the rules Fill in the wildcards to determine the input of the matching rule Check whether this input is available; if not, repeat Step 2 until everything is resolved Run: If needed, create the folder for the output(s) If needed, remove the outdated output(s) Run the shell command with the placeholders replaced Check that the command ran without errors and produced the expected output(s) Debugging advice Sometimes, Snakemake will give you a precise error report, but other times less so. Try to identify which phase of execution failed (see previous paragraph on order of operations) and double-check the most common error causes for that phase: Parsing phase failures (phase 1): Syntax errors, among which (but not limited to): This errors can be easily solved using a text editor with Python/Snakemake text colouring Missing commas/colons/semicolons Unbalanced quotes/brackets/parenthesis Wrong indentation Failure to evaluate expressions Problems in functions ( expand() , input functions\u2026) in input/output directives Python logic added outside of rules Other problems with rule definition Invalid rule names/directives Invalid wildcard names Mismatched wildcards DAG building failures (phase 2, before Snakemake tries to run any job): Failure to determine the target Ambiguous rules making the same output(s) On the contrary, no rule making the required output(s) Circular dependency (violating the \u2018Acyclic\u2019 property of a D A G). Write-protected output(s) DAG running failures (phase 3, --dry-run works and builds the DAG, but the real execution fails): When a job fails, Snakemake reports an error, deletes all output file(s) for that job (potential corruption), and stops Shell command returning non-zero status Missing output file(s) after the commands have run Reference to a $shell_variable before it was set Use of a wrong/unknown placeholder inside { }","title":"Making a more general-purpose Snakemake workflow"},{"location":"course_material/day2/3_generalising_snakemake/#learning-outcomes","text":"After having completed this chapter you will be able to: Create rules with multiple inputs and outputs Make the code shorter and more general by using placeholders and wildcards Optimise the memory usage of a workflow and checking its performances Visualise a workflow DAG","title":"Learning outcomes"},{"location":"course_material/day2/3_generalising_snakemake/#data-origin","text":"The data we will use during the exercises was produced in this work . Briefly, the team studied the transcriptional response of a strain of baker\u2019s yeast, Saccharomyces cerevisiae , facing environments with different amount of CO 2 . To this end, they performed 150 bp paired-end sequencing of mRNA-enriched samples. Detailed information on all the samples are available here , but just know that for the purpose of the course, we selected 6 samples ( 3 replicates per condition , low and high CO 2 ) and down-sampled them to 1 million read-pairs each to reduce computation times.","title":"Data origin"},{"location":"course_material/day2/3_generalising_snakemake/#exercises","text":"One of the aims of today\u2019s course is to develop a basic, yet efficient, workflow to analyse RNAseq data. This workflow takes reads coming from RNA sequencing as inputs and outputs a list of genes that are differentially expressed between two conditions. The files containing the reads are in FASTQ format and the output will be a tab-separated file containing a list of genes with expression changes, results of statistical tests\u2026 In this series of exercises, we will create the \u2018backbone\u2019 of the workflow, i.e. the rules that are the most computationally expensive, namely: A rule to trim poor-quality reads A rule to map the trimmed reads on a reference genome A rule to convert and sort files from the SAM format to the BAM format A rule to count the reads mapping on each gene At the end of this series of exercises, the DAG of your workflow should look like this: ![backbone_rulegraph](../../../assets/images/backbone_rulegraph.png) Rulegraph of the workflow at the end of the session Designing and debugging a workflow If you have problems designing your Snakemake workflow or debugging it, you can find some help here .","title":"Exercises"},{"location":"course_material/day2/3_generalising_snakemake/#general-instructions-and-reminders","text":"In each rule, you should try (as much as possible) to: Choose meaningful rule names Use rules dependency, with the syntax rules..output If you use numbered outputs, the syntax becomes rules..output[n] (with n starting at 0) If you use named outputs, the syntax becomes rules..output. Use placeholders Use wildcards Choose meaningful wildcard names The output , log , and benchmark directives must have the same wildcard names! You can use the same wildcard names in multiple rules for consistency and readability, but Snakemake will treat them as independent wildcards and their values will not be shared: rules are self-contained and wildcards are local to each rule ( see a very nice summary on wildcards ) Use multiple inputs/outputs (when needed/possible) Create a log file with the log directive Create a benchmark file with the benchmark directive If you have a doubt, do not hesitate to test your workflow logic with a dry-run (the -n flag): snakemake --cores 1 -n . Snakemake will then display all the jobs required to generate the target. To obtain additional information on why a specific job is necessary, run Snakemake with the -r flag (which can be -and usually is- combined with -n ): snakemake --cores 1 -n -r . For each job, Snakemake will print a reason field explaining why the job was required. To visualize the exact command executed by each job (with the placeholders and wildcards replaced by their values), run snakemake with the -p flag: snakemake --cores 1 -n -r -p .","title":"General instructions and reminders"},{"location":"course_material/day2/3_generalising_snakemake/#downloading-the-data-and-setting-up-the-directory-structure","text":"In this part, we will download the data and start building the directory structure of our workflow according to the official recommendations . We already starting doing so in the previous series of exercises and ultimately, it should resemble this: \u2502\u2500\u2500 .gitignore \u2502\u2500\u2500 README.md \u2502\u2500\u2500 LICENSE.md \u2502\u2500\u2500 benchmarks \u2502 \u2502\u2500\u2500 sample1.fastq \u2502 \u2514\u2500\u2500 sample2.fastq \u2502\u2500\u2500 config \u2502 \u2502\u2500\u2500 config.yaml \u2502 \u2514\u2500\u2500 some-sheet.tsv \u2502\u2500\u2500 data \u2502 \u2502\u2500\u2500 sample1.fastq \u2502 \u2514\u2500\u2500 sample2.fastq \u2502\u2500\u2500 images \u2502 \u2514\u2500\u2500 rulegraph.svg \u2502\u2500\u2500 logs \u2502 \u2502\u2500\u2500 sample1.log \u2502 \u2514\u2500\u2500 sample2.log \u2502\u2500\u2500 results \u2502 \u2502\u2500\u2500 sample1 \u2502 \u2502 \u2514\u2500\u2500 sample1.bam \u2502 \u2502\u2500\u2500 sample2 \u2502 \u2502 \u2514\u2500\u2500 sample2.bam \u2502 \u2514\u2500\u2500 DEG_list.tsv \u2502\u2500\u2500 resources \u2502 \u2502\u2500\u2500 Scerevisiae.fasta \u2502 \u2514\u2500\u2500 Scerevisiae.gtf \u2514\u2500\u2500 workflow \u2502\u2500\u2500 envs \u2502 \u2502\u2500\u2500 tool1.yaml \u2502 \u2514\u2500\u2500 tool2.yaml \u2502\u2500\u2500 rules \u2502 \u2502\u2500\u2500 module1.smk \u2502 \u2514\u2500\u2500 module2.smk \u2502\u2500\u2500 scripts \u2502 \u2502\u2500\u2500 script1.py \u2502 \u2514\u2500\u2500 script2.R \u2514\u2500\u2500 Snakefile For now, the main thing to remember is that the workflow code goes into a subfolder called workflow and the rest is mostly input/output files, except for the config subfolder, which will be explained later. All output files generated in the workflow should be stored under results/ . Now, let\u2019s download the data, uncompress it and build the first part of the directory structure. wget https://apollo.vital-it.ch/trackvis/snakemake_rnaseq.tar.gz # Download the data tar -xvf snakemake_rnaseq.tar.gz # Uncompress the archive rm snakemake_rnaseq.tar.gz # Delete the archive cd snakemake_rnaseq/ # Start developing in a new folder In this new folder, you should now see 2 subfolders: data/ , which contains the data to analyse resources/ , which contains retrieved resources, here the assembly, the genome indices and the annotation file of S. cerevisiae . It may also contain small resources delivered along with the workflow via git Let\u2019s create another subfolder, this time to host all the files containing the code, as well as the Snakefile: mkdir workflow # Create a new folder touch workflow/Snakefile # Create an empty Snakefile The Snakefile marks the entrypoint of the workflow. It will be automatically discovered when running Snakemake from the root of the structure, here snakemake_rnaseq/ . We can also tell Snakemake to use a specific Snakefile with the -s flag: snakemake --cores 1 -s , but it is highly discouraged as it hampers reproducibility. If you followed the general instructions , Snakemake should create all the other missing folders by itself (except one that you will discover at the end of this series of exercises), so it is now time to create the rules mentioned earlier . Have a look here for a few pieces of advice on workflow design. \u2018bottom-up\u2019 or \u2018top-down\u2019 development? Even if it is often easier to start from the final outputs and work backwards to the first inputs, the next exercises are presented in the opposite direction (first inputs to last outputs) to make the session easier to understand. That being said, feel free to work and develop your code in the order you prefer! Even if we asked you to use wildards, do not try to process all the samples yet. Choose and work with one sample (which means two .fastq files because reads are paired-end) in this series of exercises. We will see an efficient way to process list of files in the next series of exercises.","title":"Downloading the data and setting up the directory structure"},{"location":"course_material/day2/3_generalising_snakemake/#creating-a-rule-to-trim-reads","text":"Usually, the first step in dealing with sequencing data is to improve the reads quality by removing low quality bases, stretches of As and Ns and reads that are too short. Adapters trimming In theory, trimming also removes sequencing adapters, but we will not do it here to keep computation time low and avoid having to parse other files to extract the adapter sequences. Exercise: Implement a rule to trim the reads contained in .fastq files using atropos . Hint You can find information on how to use atropos and its parameters with atropos trim -h The files to trim are located in data/ The base of the trimming command is atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" If you are interested in what these options mean, see below for an explanation The paths of the files to trim ( i.e. input files, in FASTQ format) are specified with the options -pe1 (first read) and -pe2 (second read) The paths of the trimmed files ( i.e. output files, also in FASTQ format) are specified with the options -o (first read) and -p (second read) atropos outputs some information as well as its trimming report in the terminal (stdout to be exact); do not forget to redirect these information to the log file with >> {log} Please give it a try before looking at the answer! Answer This is one way of writing this rule, but definitely not the only way! This is true for all the rules presented here. rule fastq_trim : ''' This rule trims paired-end reads to improve their quality. Specifically, it removes: - Low quality bases - A stretches longer than 20 bases - N stretches ''' input : reads1 = 'data/ {sample} _1.fastq' , reads2 = 'data/ {sample} _2.fastq' , output : trim1 = 'results/ {sample} / {sample} _atropos_trimmed_1.fastq' , trim2 = 'results/ {sample} / {sample} _atropos_trimmed_2.fastq' log : 'logs/ {sample} / {sample} _atropos_trimming.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_trimming.txt' resources : mem_mb = 500 shell : ''' echo \"Trimming reads in <{input.reads1}> and <{input.reads2}>\" > {log} atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 \\ --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" \\ -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo \"Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively\" >> {log} echo \"Trimming report saved in <{log}>\" >> {log} ''' Note the three things that are happening here: We used the {sample} wildcards twice in the output paths. This is because we prefer to have all the files linked to a sample in the same directory We added a memory limit for this job: 500 MB. Because we have limited resources in this server compared to a High Performance Computing cluster (HPC), this will help Snakemake to better allocate resources and parallelise jobs. You can determine the maximum amount of memory used by a rule thanks to the max_rss column in a benchmark result (results are shown in MB). More information here We used a backslash \\ to split a very long line in smaller lines. This is purely \u2018cosmetic\u2019, to avoid very long lines that are painful to read, copy\u2026 Paths in Snakemake All the paths in the Snakefile are relative to the working directory in which the snakemake command is executed. If you execute Snakemake in snakemake_rnaseq/ , the relative path to the input files in the rule is data/.fastq If you execute Snakemake in snakemake_rnaseq/workflow/ , the relative path to the input files in the rule is ../data/.fastq Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_atropos_trimmed_1.fastq If you run it now, don\u2019t forget to have a look at the log and benchmark files!","title":"Creating a rule to trim reads"},{"location":"course_material/day2/3_generalising_snakemake/#atropos-options","text":"-q 20,20 : trim low-quality bases from 5\u2019, 3\u2019 ends of each read before adapter removal --minimum-length 25 : discard trimmed reads that are shorter than 25 bp --trim-n : trim N\u2019s on ends of reads --preserve-order : preserve order of reads in input files --max-n 10 : discard reads with more than 10 N --no-cache-adapters : do not cache adapters list as \u2018.adapters\u2019 in the working directory -a \"A{{20}}\" -A \"A{{20}}\" : remove series of 20 As in the adapter sequence ( -a for the first read of the pair, -A for the second one) The usual command-line syntax is -a \"A{20}\" . Here, brackets were doubled to prevent Snakemake from interpreting {20} as a wildcard","title":"atropos options"},{"location":"course_material/day2/3_generalising_snakemake/#creating-a-rule-to-map-trimmed-reads-on-a-reference-genome","text":"Once we have trimmed reads, the next step is to map those reads onto a reference assembly, here S. cerevisiae strain S288C, to eventually obtain read counts. The assembly used in this exercise is RefSeq GCF_000146045.2 and was retrieved via the NCBI genome website. Exercise: Implement a rule to map the trimmed reads on the S. cerevisiae assembly using HISAT2 . HISAT2 genome index To align reads on a genome, HISAT2 relies on a graph-based index. We built the genome index for you, using the command hisat2-build -p 24 -f Scerevisiae.fasta resources/genome_indices/Scerevisiae_index . -p is the number of threads to use, -f is the genomic sequence in FASTA format and Scerevisiae_genome_index is the global name shared by all the index files. Hint You can find information on how to use HISAT2 and its parameters with hisat2 -h The base of the mapping command is hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal If you are interested in what these options mean, see below for an explanation The path of the genome indices ( i.e. input files, in binary format) is specified with the option -x . The files have a shared title of resources/genome_indices/Scerevisiae_genome_index , which is the value you need to use for -x The paths of the trimmed files ( i.e. input files) are specified with the options -1 (first read) and -2 (second read) The path of the mapped reads file ( i.e. output file, in SAM format) is specified with the option -S (do not forget the .sam extension to the filename) The path of the mapping report ( i.e. output file, in text format) is specified with the option --summary-file HISAT2 also outputs information in the terminal (stderr to be exact); do not forget to redirect these to the log file with 2>> {log} This step is the longest of the workflow. With the current settings, it should take ~6 min to complete. If you decide to run it now, you should launch it and start working on the next rules Please give it a try before looking at the answer! Answer rule read_mapping : ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input : trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : sam = 'results/ {sample} / {sample} _mapped_reads.sam' , report = 'results/ {sample} / {sample} _mapping_report.txt' log : 'logs/ {sample} / {sample} _mapping.log' benchmark : 'benchmarks/ {sample} / {sample} _mapping.txt' resources : mem_gb = 2 shell : ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \\ -x resources/genome_indices/Scerevisiae_genome_index \\ -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads.sam If you run it now, don\u2019t forget to have a look at the log and benchmark files!","title":"Creating a rule to map trimmed reads on a reference genome"},{"location":"course_material/day2/3_generalising_snakemake/#hisat2-options","text":"--dta : report alignments tailored for transcript assemblers --fr : set alignment of -1, -2 mates to forward/reverse (position of reads in a pair relatively to each other) --no-mixed : remove unpaired alignments for paired reads --no-discordant : remove discordant alignments for paired reads --time : print wall-clock time taken by search phases --new-summary : print alignment summary in a new style --no-unal : suppress SAM records for reads that failed to align","title":"HISAT2 options"},{"location":"course_material/day2/3_generalising_snakemake/#creating-a-rule-to-convert-and-sort-sam-files-to-bam","text":"HISAT2 only outputs mapped reads in the SAM format . However, most downstream analysis tools use the BAM format , which is the compressed binary version of the SAM format and, as such, is much smaller, easier to manipulate and transfer and allows a faster data retrieval. Additionally, many analyses require that BAM files are sorted by genomic coordinates and indexed, because sorted BAM files can be processed much more easily and quickly than unsorted ones. Alignment data files More information on alignment data files and other formats on the official github repository of the formats. Exercise: Implement a single rule to: Convert SAM files to BAM using Samtools Sort the BAM files using Samtools Index the sorted BAM files using Samtools Hint You can find information on how to use Samtools and its parameters with samtools --help You need to write 3 commands that will be executed sequentially: the output of command 1 will be the input of command 2 etc\u2026 No panic! These commands are pretty simple and do not use many options! To convert SAM format to the BAM format, use the command samtools view -b -o To sort a BAM file, use the command samtools sort -O bam -o To index a BAM file, use the command samtools index -b -o The index must have the exact same name than its associated BAM file, except it finishes with the extension .bam.bai instead of .bam If you are interested in what these options mean, see below for an explanation To catch potential information and errors, do not forget to redirect stderr to the log file with 2>> {log} Please give it a try before looking at the answer! Answer rule sam_to_bam : ''' This rule converts a sam file to bam format, sorts it and indexes it. ''' input : sam = rules . read_mapping . output . sam output : bam = 'results/ {sample} / {sample} _mapped_reads.bam' , bam_sorted = 'results/ {sample} / {sample} _mapped_reads_sorted.bam' , index = 'results/ {sample} / {sample} _mapped_reads_sorted.bam.bai' log : 'logs/ {sample} / {sample} _mapping_sam_to_bam.log' benchmark : 'benchmarks/ {sample} / {sample} _mapping_sam_to_bam.txt' resources : mem_mb = 250 shell : ''' echo \"Converting <{input.sam}> to BAM format\" > {log} samtools view {input.sam} -b -o {output.bam} 2>> {log} echo \"Sorting BAM file\" >> {log} samtools sort {output.bam} -O bam -o {output.bam_sorted} 2>> {log} echo \"Indexing the sorted BAM file\" >> {log} samtools index -b {output.bam_sorted} -o {output.index} 2>> {log} echo \"Sorted file saved in <{output.bam_sorted}>\" >> {log} ''' Exercise: If you had to run the workflow by specifying only one output, what command would you use? Answer snakemake --cores 1 -r -p results/highCO2_sample1/highCO2_sample1_mapped_reads_sorted.bam If you run it now, don\u2019t forget to have a look at the log and benchmark files!","title":"Creating a rule to convert and sort SAM files to BAM"},{"location":"course_material/day2/3_generalising_snakemake/#samtools-options","text":"samtools view -b : flag to tell Samtools to create an output in BAM format -o : path of the output file samtools view -O bam : flag to tell Samtools to create an output in BAM format -o : path of the output file samtools index -b : flag to tell Samtools to create an index in BAI format","title":"Samtools options"},{"location":"course_material/day2/3_generalising_snakemake/#creating-a-rule-to-count-mapped-reads","text":"Most of the analyses happening downstream the alignment step, including Differential Expression Analyses, are starting off read counts, either by exon or gene. However, we are still missing those counts! Counting reads on exons/genes To count reads mapping on genomic features, we first need a definition of those features. In this case, we picked one of the best-known model organism, S. cerevisiae , which has been annotated for a long time. These annotations are easily available on the NCBI or the Saccharomyces Genome Database . If your organism has not been annotated yet, there are ways to work around this problem, but this is an entirely different field that we won\u2019t discuss here! Chromosome names If you are working with genome sequences and annotations from different sources, remember that they must contain the chromosome names, otherwise counting will not work. Exercise: Implement a rule to count the reads mapping on each gene of the S. cerevisiae genome using featureCounts . Hint You can find information on how to use featureCounts and its parameters with featureCounts -h The base of the mapping command is featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF If you are interested in what these options mean, see below for an explanation The path of the file containing the annotations ( i.e. input files, in GTF format) is specified with the -a option. This file is located at resources/Scerevisiae.gtf There are two main annotations format: GTF and GFF . The former is lighter and easier to work with, so that is the one we will use The paths of the sorted BAM file(s) ( i.e. input file(s)) are not specified with an option, they are simply added at the end of the command The path of the file containing the count results ( i.e. output file, in tsv format) is specified with the option -o featureCounts will also output a separate file (in tsv format) including summary statistics of counting results, with the name .summary. For example, if the output is test.tsv , the summary will be printed in test.tsv.summary . Do not forget this output in your rule featureCounts also outputs information in the terminal (stderr to be exact); do not forget to redirect these to the log file with 2>> {log} Please give it a try before looking at the answer! Answer rule reads_quantification_genes : ''' This rule quantifies the reads of a bam file mapping on genes and produces a count table for all genes of the assembly. ''' input : bam_once_sorted = rules . sam_to_bam . output . bam_sorted , output : gene_level = 'results/ {sample} / {sample} _genes_read_quantification.tsv' , gene_summary = 'results/ {sample} / {sample} _genes_read_quantification.summary' log : 'logs/ {sample} / {sample} _genes_read_quantification.log' benchmark : 'benchmarks/ {sample} / {sample} _genes_read_quantification.txt' resources : mem_mb = 500 shell : ''' echo \"Counting reads mapping on genes in <{input.bam_once_sorted}>\" > {log} featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \\ -a resources/Scerevisiae.gtf -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo \"Renaming output files\" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo \"Results saved in <{output.gene_level}>\" >> {log} echo \"Report saved in <{output.gene_summary}>\" >> {log} '''","title":"Creating a rule to count mapped reads"},{"location":"course_material/day2/3_generalising_snakemake/#featurecounts-options","text":"-t : specify on which feature type to count the reads -g : specify if and how to gather feature counts. Here, reads are counted by exon ( -t ) and the exon counts are gathered by genes \u2018meta-features\u2019 ( -g ) -s : perform strand-specific read counting Strandedness is determined by looking at the mRNA library preparation kit. It can also be determined a posteriori with scripts such as infer_experiment.py from the RSeQC package -p : count fragments instead of reads. If you don\u2019t use this option with paired-end reads, featureCounts won\u2019t be able to assign the read-pairs to features -B : only count read pairs that have both ends aligned -C : do not count read pairs that have their two ends mapping to different chromosomes or mapping on the same chromosome but on different strands --largestOverlap : assign reads to the meta-feature/feature that has the largest number of overlapping bases -F : specify format of the provided annotation file --verbose : output verbose information, such as unmatched chromosome/contig names","title":"featureCounts options"},{"location":"course_material/day2/3_generalising_snakemake/#running-the-workflow","text":"Exercise: If you have not done it after each step, it is now time to run the entire workflow on your sample of choice. What command will you use to run it? Answer Because all the rules are chained together, you only need to specify one of the final outputs to trigger the execution of all the previous rules: snakemake --cores 1 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv . You can add the -F flag to force an entire re-run. The entire run should take about ~10 min to complete. Exercise: Check Snakemake\u2019s log in .snakemake/log/ . Is everything as you expected, especially the wildcards values, input and output names etc\u2026? Answer cat .snakemake/log/","title":"Running the workflow"},{"location":"course_material/day2/3_generalising_snakemake/#visualising-the-dag-of-the-workflow","text":"We have now implemented and run the main steps of our workflow. It is always a good idea to visualise the whole process to check for errors and inconsistencies. Snakemake\u2019s has a built-in workflow visualisation feature to do this. Exercise: Visualise the entire workflow\u2019s Directed Acyclic Graph using the --dag flag. Do you need to specify a target? Hint Try to follow the official recommendations on workflow structure, which states that images are supposed to go in the images/ subfolder Snakemake prints a DAG in text format, so we need to use the dot command to transform it into a picture Save the result as a PNG picture Answer If we run the command without target: snakemake --cores 1 --dag -F | dot -Tpng > images/dag.png , we will get a Target rules may not contain wildcards. error, which means we need to add a target. Same as before, it makes sense to use one of the final outputs to get the entire workflow: snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png . But once again, we will get an error: BrokenPipeError: [Errno 32] Broken pipe . This is because we are piping the command output to a folder ( images/ ) that does not exist yet The folder is not created by Snakemake because it isn\u2019t handled as part of an actual run. So we have to create the folder before generating the DAG: mkdir images snakemake --cores 1 --dag -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpng > images/dag.png Some explanations on the command: -F : force to show the entire worklow and ensures all jobs are shown. You can also use -f to show fewer jobs dot : tool that is a part of the graphviz package and is used to draw hierarchical or layered drawings of directed graphs, i.e. graphs in which edges (arrows) have a direction -T : control the image format. Available formats are listed here DAG aspect If you already computed all the outputs of the workflow, steps in the DAG will have dotted lines. To visualise the DAG before running the workflow, add -F/--forceall to the snakemake command to force the execution of all jobs. DAG = dry-run The --dag flag implicitly activates the --dry-run/--dryrun/-n option, which means that no jobs are executed during the plot creation. There are actually 3 types of DAG: A DAG, created with the --dag option A filegraph, created with the --filegraph option A rulegraph, created with the --rulegraph option Exercise: Generate the filegraph and rulegraph of your workflow. Feel free to try different pictures format. What are the differences between the plots? Answer Generate the rulegraph: snakemake --cores 1 --rulegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tpdf > images/rulegraph.pdf Generate the filegraph: snakemake --cores 1 --filegraph -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv | dot -Tjpg > images/filegraph.jpg You should obtain the 3 following figures: ![backbone_dag](../../../assets/images/backbone_dag.png){ width=\"30%\" height=\"450\" } ![backbone_rulegraph](../../../assets/images/backbone_rulegraph.png){ width=\"30%\" height=\"450\" } ![backbone_filegraph](../../../assets/images/backbone_filegraph.png){ width=\"30%\" height=\"450\" } DAG, rulegraph and filegraph (respectively) of the workflow at the end of the session The differences between these plots are: --dag : dependency graph of all the jobs --filegraph : dependency graph of rules with inputs and outputs (rule appears once, with wildcards) --rulegraph : dependency graph of rules (rule appears once)","title":"Visualising the DAG of the workflow"},{"location":"course_material/day2/3_generalising_snakemake/#designing-a-snakemake-workflow-and-debugging-it","text":"","title":"Designing a Snakemake workflow... and debugging it!"},{"location":"course_material/day2/3_generalising_snakemake/#designing-a-workflow","text":"There are many ways to design a new workflow, but these few pieces of advice will be useful in most cases: Start with a pen and paper: try to find out how many rules you will need and how they depend on each other. In other terms, start by sketching the DAG of your workflow! Remember that Snakemake has a bottom-up approach (it goes from the final outputs to the first input), so it may be easier for you to work in that order as well and write your last rule first Determine which rules (if any) aggregate or split inputs and create input functions accordingly (we will see how these functions work in session 4) Make sure your input and output directives are right before worrying about anything else, especially the shell sections. Remember that Snakemake builds the DAG before running the shell commands, so you can use the --dryrun option to test the workflow before running it. You can even do that without writing all the shell commands! List any parameters or settings that might need to be adjusted Choose meaningful and easy-to-understand names for your inputs, outputs, parameters, wildcards\u2026 to make your Snakefile as readable as possible. This is true for every script, piece of code, variable etc\u2026 and Snakemake is no exception! Have a look at The Zen of Python for more information","title":"Designing a workflow"},{"location":"course_material/day2/3_generalising_snakemake/#debugging-a-workflow","text":"It is very likely you will see bugs and errors the first time you try to run a new Snakefile: don\u2019t be discouraged, this is normal! Order of operations in Snakemake The topic was tackled when DAGs were mentioned, but to efficiently debug a workflow, it is worth taking a deeper look at what Snakemake does when you execute the command snakemake --cores 1 . There are 3 main phases: Prepare to run: Read all the rule definitions from the Snakefile Resolve the DAG (when Snakemake says \u2018Building DAG of jobs\u2019): Check what output(s) are required Look for a matching rule by looking at the outputs of all the rules Fill in the wildcards to determine the input of the matching rule Check whether this input is available; if not, repeat Step 2 until everything is resolved Run: If needed, create the folder for the output(s) If needed, remove the outdated output(s) Run the shell command with the placeholders replaced Check that the command ran without errors and produced the expected output(s) Debugging advice Sometimes, Snakemake will give you a precise error report, but other times less so. Try to identify which phase of execution failed (see previous paragraph on order of operations) and double-check the most common error causes for that phase: Parsing phase failures (phase 1): Syntax errors, among which (but not limited to): This errors can be easily solved using a text editor with Python/Snakemake text colouring Missing commas/colons/semicolons Unbalanced quotes/brackets/parenthesis Wrong indentation Failure to evaluate expressions Problems in functions ( expand() , input functions\u2026) in input/output directives Python logic added outside of rules Other problems with rule definition Invalid rule names/directives Invalid wildcard names Mismatched wildcards DAG building failures (phase 2, before Snakemake tries to run any job): Failure to determine the target Ambiguous rules making the same output(s) On the contrary, no rule making the required output(s) Circular dependency (violating the \u2018Acyclic\u2019 property of a D A G). Write-protected output(s) DAG running failures (phase 3, --dry-run works and builds the DAG, but the real execution fails): When a job fails, Snakemake reports an error, deletes all output file(s) for that job (potential corruption), and stops Shell command returning non-zero status Missing output file(s) after the commands have run Reference to a $shell_variable before it was set Use of a wrong/unknown placeholder inside { }","title":"Debugging a workflow"},{"location":"course_material/day2/4_decorating_workflow/","text":"Learning outcomes After having completed this chapter you will be able to: Optimise a workflow by multi-threading Use non-file parameters and config files in rules Create rules with non-conventional outputs Modularise a workflow Make a workflow process a list of files rather than one file at a time Exercises In this series of exercises, we will create only one new rule to add to our workflow, because this part aims mainly to show how to improve and \u2018decorate\u2019 the rules we previously wrote. Development and back-up During this session, we will modify our Snakefile quite heavily, so it may be a good idea to start by making a back-up: cp worklow/Snakefile worklow/Snakefile_backup . As a general rule, if you have a doubt on the code you are developing, do not hesitate to make a back-up. Optimising a workflow by multi-threading When working with real datasets, most processes are very long and computationally expensive. Fortunately, they can be parallelised very efficiently to decrease the computation time by using several threads for a single job. Exercise: Parallelise as much processes as possible using the threads directive and test its effect: Identify which software can make use of parallelisation Identify in each software the parameter that controls multi-threading Implement the multi-threading Hint Check the software documentation and parameters with the -h/--help flags Remember that multi-threading only applies to software that can make use of a threads parameters, Snakemake itself cannot parallelise a software automatically Remember that you need to add threads to the Snakemake rule but also to the commands! Just increasing the number of threads in Snakemake will not magically run a command with multiple threads Remember that you have 4 threads in total, so even if you ask for more in a rule, Snakemake will cap this value at 4. And if you use 4 threads in a rule, that means that no other job can run parallel! Answer It turns out that all the software except samtools index can handle multi-threading: atropos trim , hisat2 , samtools view , and samtools sort use the --threads option featureCounts uses the -T option Let\u2019s use 4 threads for the mapping step and 2 for the other steps. Your Snakefile should look like this: rule fastq_trim: ''' This rule trims paired-end reads to improve their quality. Specifically, it removes: - Low quality bases - A stretches longer than 20 bases - N stretches ''' input: reads1 = 'data/{sample}_1.fastq', reads2 = 'data/{sample}_2.fastq', output: trim1 = 'results/{sample}/{sample}_atropos_trimmed_1.fastq', trim2 = 'results/{sample}/{sample}_atropos_trimmed_2.fastq' log: 'logs/{sample}/{sample}_atropos_trimming.log' benchmark: 'benchmarks/{sample}/{sample}_atropos_trimming.txt' resources: mem_mb = 500 threads: 2 shell: ''' echo \"Trimming reads in <{input.reads1}> and <{input.reads2}>\" > {log} atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 \\ --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" --threads {threads} \\ -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo \"Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively\" >> {log} echo \"Trimming report saved in <{log}>\" >> {log} ''' rule read_mapping: ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input: trim1 = rules.fastq_trim.output.trim1, trim2 = rules.fastq_trim.output.trim2 output: sam = 'results/{sample}/{sample}_mapped_reads.sam', report = 'results/{sample}/{sample}_mapping_report.txt' log: 'logs/{sample}/{sample}_mapping.log' benchmark: 'benchmarks/{sample}/{sample}_mapping.txt' resources: mem_gb = 2 threads: 4 shell: ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \\ -x resources/genome_indices/Scerevisiae_index --threads {threads} \\ -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' rule sam_to_bam: ''' This rule converts a sam file to bam format, sorts it and indexes it. ''' input: sam = rules.read_mapping.output.sam output: bam = 'results/{sample}/{sample}_mapped_reads.bam', bam_sorted = 'results/{sample}/{sample}_mapped_reads_sorted.bam', index = 'results/{sample}/{sample}_mapped_reads_sorted.bam.bai' log: 'logs/{sample}/{sample}_mapping_sam_to_bam.log' benchmark: 'benchmarks/{sample}/{sample}_mapping_sam_to_bam.txt' resources: mem_mb = 250 threads: 2 shell: ''' echo \"Converting <{input.sam}> to BAM format\" > {log} samtools view {input.sam} --threads {threads} -b -o {output.bam} 2>> {log} echo \"Sorting BAM file\" >> {log} samtools sort {output.bam} --threads {threads} -O bam -o {output.bam_sorted} 2>> {log} echo \"Indexing the sorted BAM file\" >> {log} samtools index -b {output.bam_sorted} -o {output.index} 2>> {log} echo \"Sorted file saved in <{output.bam_sorted}>\" >> {log} ''' rule reads_quantification_genes: ''' This rule quantifies the reads of a bam file mapping on genes and produces a count table for all genes of the assembly. The strandedness parameter is determined by get_strandedness(). ''' input: bam_once_sorted = rules.sam_to_bam.output.bam_sorted, output: gene_level = 'results/{sample}/{sample}_genes_read_quantification.tsv', gene_summary = 'results/{sample}/{sample}_genes_read_quantification.summary' log: 'logs/{sample}/{sample}_genes_read_quantification.log' benchmark: 'benchmarks/{sample}/{sample}_genes_read_quantification.txt' resources: mem_mb = 500 threads: 2 shell: ''' echo \"Counting reads mapping on genes in <{input.bam_once_sorted}>\" > {log} featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \\ -a resources/Scerevisiae.gtf -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo \"Renaming output files\" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo \"Results saved in <{output.gene_level}>\" >> {log} echo \"Report saved in <{output.gene_summary}>\" >> {log} ''' Exercise: Finally, test the effect of the number of threads on the workflow\u2019s runtime. What command will you use to run the workflow? Does the workflow run faster? Answer The command to use is: snakemake --cores 4 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv Do not forget to provide additional cores to Snakemake in the execution command with --cores 4 . Note that the number of threads allocated to all jobs running at a given time cannot exceed the value specified with --cores . Therefore, if you leave this number at 1, Snakemake will not be able to use multiple threads. Also note that increasing --cores allows Snakemake to run multiple jobs in parallel (for example, running 2 jobs using 2 threads each). The workflow now takes ~6 min to run, compared to ~10 min before ( i.e. a 40% decrease!). This gives you an idea of how powerful multi-threading is when the datasets and computing power get bigger! Explicit is better than implicit Even if a software cannot multi-thread, it is useful to add threads: 1 in the rule to keep the rule consistency and clearly state that the software works with a single thread. Things to keep in mind when using parallel execution Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash The on-screen output from parallel jobs will be mixed, so save any output to log files instead Using non-file parameters and config files Non-file parameters As we have seen, Snakemake\u2019s execution is based around inputs and outputs of each step of the workflow. However, a lot of software rely on additional non-file parameters. In the previous presentation and series of exercises, we advocated (rightfully so!) against using hard-coded filepaths. Yet, if you look back at the rules we have implemented, you will find 2 occurrences of this behaviour in the shell command: In the rule read_mapping , the index parameter -x resources/genome_indices/Scerevisiae_index In the rule reads_quantification_genes , the annotation parameter -a resources/Scerevisiae.gtf This reduces readability and also makes it very hard to change the value of these parameters. The params directive was designed for this purpose: it allows to specify additional parameters that can also depend on the wildcards values and use input functions (see Session 4 for more information on this). params values can be of any type (integer, string, list etc\u2026) and similarly to the {input} and {output} placeholders, they can also be accessed from the shell command with the placeholder {params} . Just like for the input and output directives, you can define multiple parameters (in this case, do not forget the comma between each entry!) and they can be named (in practice, unknown parameters are unexplicit and easily confusing, so parameters should always be named!). It also helps readability and clarity to use the params section to name and assign parameters and variables for your shell command. Here is an example on how to use params : rule example : input : 'data/example.tsv' output : 'results/example.txt' params : lines = 5 shell : 'head -n {params.lines} {input} > {output} ' Parameters arguments In contrast to the input directive, the params directive can optionally take more arguments than only wildcards , namely input , output , threads , and resources . Exercise: Replace the two hard-coded paths mentioned earlier by params . Hint Add a params directive to the rules, name the parameter and replace the path by the placeholder in the shell command. Answer Note: for clarity, only the lines that changed are shown below. rule read_mapping params : index = 'resources/genome_indices/Scerevisiae_index' shell : 'hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \\ -x {params.index} --threads {threads} \\ -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} ' rule reads_quantification_genes params : annotations = 'resources/Scerevisiae.gtf' shell : 'featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \\ -a {params.annotations} -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} ' Snakemake re-run behaviour If you try to re-run only the last rule with snakemake --cores 4 -r -p -f results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv , Snakemake will actually try to re-run 3 rules in total. This is because the code changed in 2 rules (see reason field in Snakemake\u2019s log), which triggered an update of the inputs in the 3rd rule ( sam_to_bam ). To avoid this, first touch the files with snakemake --cores 1 --touch -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv then re-run the last rule. Config files That being said, there is an even better way to handle parameters like the we just modified: instead of hard-coding parameter values in the Snakefile, Snakemake allows to define parameters and their values in config files. The config files will be parsed by Snakemake when executing the workflow, and parameters and their values will be stored in a Python dictionary named config . The path to the config file can be specified either in the Snakefile with the line configfile: at the top of the file, or it can be specified at runtime with the execution parameter --configfile . Config files are stored in the config subfolder and written in the JSON or YAML format. We will use the latter for this course as it is the most user-friendly and the recommended one. Briefly, in the YAML format, parameters are defined with the syntax : . Values can be strings, integers, floating points, booleans \u2026 For a complete overview of available value types, see this list . A parameter can have multiple values, which are then each listed on an indented single line starting with \u201c - \u201d. These values will be stored in a Python list when Snakemake parses the config file. Finally, parameters can be nested on indented single lines, and they will be stored as a dictionary when Snakemake parses the config file. The example below shows a parameter with a single value ( lines_number ), a parameter with multiple values ( samples ), and an example of nested parameters ( resources ): # Parameter with a single value (string, int, float, bool ...) lines_number : 5 # Parameter with multiple values samples : - sample1 - sample2 # Nested parameters resources : threads : 4 memory : 4G Then, each parameter can be accessed in Snakefile with the following syntax: config [ 'lines_number' ] # --> 5 config [ 'samples' ] # --> ['sample1', 'sample2'] # Lists of parameters become list config [ 'resources' ] # --> {'threads': 4, 'memory': '4G'} # Lists of named parameters become dictionaries config [ 'resources' ][ 'threads' ] # --> 4 Accessing config values in shell Values stored in the config dictionary cannot be accessed directly within the shell directive. If you need to use a parameter value in shell , define the parameter in params and assign its value from the config dictionary. Exercise: Create a config file in YAML format and fill it with adapted variables and values to replace the 2 hard-coded parameters in rules read_mapping and reads_quantification_genes . Then replace the hard-coded parameters by values from the config file and add its path on top of your Snakefile. Answer Note: for clarity, only the lines that changed are shown below. The first step is to create the subfolder and an empty config file: mkdir config # Create a new folder touch config/config.yaml # Create an empty config file Then, fill the config file with the desired values: # Configuration options of RNAseq-analysis workflow # Location of the genome indices index : 'resources/genome_indices/Scerevisiae_index' # Location of the annotation file annotations : 'resources/Scerevisiae.gtf' Then, replace the params values in the Snakefile: rule read_mapping params : index = config [ 'index' ] rule reads_quantification_genes params : annotations = config [ 'annotations' ] Finally, add the file path on top of the Snakefile: configfile: 'config/config.yaml' Now, if we need to change these values, we can easily do it in the config file instead of modifying the code! Using non-conventional outputs Snakemake has several built-in utilities to assign properties to outputs that are deemed \u2018special\u2019. These properties are listed in the table below: Property Syntax Function Temporary temp('path/to/file.txt') File is deleted as soon as it is not required by any future jobs Protected protected('path/to/file.txt') File cannot be overwritten after the job ends (useful to prevent erasing a file by mistake) Ancient ancient('path/to/file.txt') File will not be re-created when running the pipeline (useful for files that require heavy computation) Directory directory('path/to/directory') Output is a directory instead of a file (use \u2018touch\u2019 instead if possible) Touch touch('path/to/file.txt') Create an empty flag file \u2018file.txt\u2019 regardless of the shell command (if the command finished without errors) The next paragraphs will show how to use some of these properties. Use-case of the temp() command Exercise: Can you think of a convenient use of temp() command? Answer The temp() command is extremely useful to automatically remove intermediary outputs that are no longer needed. Exercise: In your workflow, identify outputs that are intermediary and mark them as temporary with temp() . Answer The unsorted .bam and the .sam outputs seem like great candidates to be marked as temporary. One could also argue that the trimmed FASTQ files are also temporary, but we will keep them for now. Note: for clarity, only the lines that changed are shown below. rule read_mapping output : sam = temp ( 'results/ {sample} / {sample} _mapped_reads.sam' ), rule sam_to_bam output : bam = temp ( 'results/ {sample} / {sample} _mapped_reads.bam' ), Consequences of using temp() Removing temporary outputs is a great way to save a lot of storage space. If you look at the size of your current results/ folder ( du -bchd0 results/ ), you will notice that it drastically. Just removing these two files would allow to save ~1 GB. While it may not seem a lot, remember that you usually have much bigger files and many more samples! On the other hand, using temporary outputs might force you to re-run more jobs than necessary if an input changes, so carefully think about it before using it. Exercise: On the contrary, is there a file of your workflow that you would like to protect with protected() Answer This is debatable, but one could argue that the sorted .bam file is a good candidate for protection. rule sam_to_bam output : bam_sorted = protected ( 'results/ {sample} / {sample} _mapped_reads_sorted.bam' ), If you set this output as protected, be careful when you want to re-run your workflow and recreate the file! Use-case of the directory() command: the FastQC example FastQC is a program designed to spot potential problems in high-througput sequencing datasets. It is a very popular tool, notably because it runs quickly and does not require a lot of configuration. It runs a set of analyses on one or more raw sequence files in FASTQ or BAM format and produces a report with quality plots that summarises the results. It will highlight any areas where a dataset looks unusual and might require a closer look. FastQC can be run interactively or in batch mode, during which it saves results as an HTML file and a ZIP file. We will soon see that running FastQC in batch mode presents a little problem. Data types and FastQC FastQC does not differentiate between sequencing techniques and as such can be used to look at libraries coming from a large number of experiments (Genomic Sequencing, ChIP-Seq, RNAseq, BS-Seq etc\u2026). If you run fastqc -h , you will notice something a bit surprising (but not unusual in bioinformatics): -o --outdir Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed. -d --dir Selects a directory to be used for temporary files written when generating report images. Defaults to system temp directory if not specified. Two files are produced for each FASTQ file and these files appear in the same directory as the input file: FastQC does not allow to specify the names of the output files! However, we can set an alternative output directory, even though it needs to be created before FastQC is run. There are different solutions to deal with this problem: Work with the default file names produced by FastQC and leave the reports in the same directory than the input files Create the outputs in a new directory and leave the reports with their default name Create the outputs in a new directory and tell Snakemake that the directory itself is the output Force a naming convention by renaming the FastQC output files within the rule For the sake of time, we will not test all 4 solutions, but rather try to apply the 3 rd or the 4 th solution. We\u2019ll briefly summarise solutions 1 and 2 here: This could work, but it\u2019s better not to put the reports in the same directory than the input sequences. As a general principle, when writing Snakemake rules, we prefer to be in charge of the output names and to have all the files linked to a sample in the same directory This involves manually constructing the output directory path to use with the -o option, which works but isn\u2019t very convenient The base of the FastQC command is the following: fastqc --format fastq --threads 2 -t/--threads : specify the number of files which can be processed simultaneously. Here, it will be 2 because the inputs are paired-end files The -o and -d will be used in the last 2 solutions that we will now see in details We will create a single rule to run FastQC on both the original and the trimmed FASTQ files Choose only one solution to implement: Solution 3 Solution 4 This option amounts to tell Snakemake not to worry about individual files at all and consider the output of the rule as an entire directory. Exercise: Implement a single rule to run FastQC on both the original and the trimmed FASTQ files (4 files in total) using directories as ouputs with the directory() command. Answer This makes the rule definition quite \u2018simple\u2019: rule fastq_qc_sol3 : ''' This rule performs a QC on paired-end fastq files before and after trimming. ''' input : reads1 = rules . fastq_trim . input . reads1 , reads2 = rules . fastq_trim . input . reads2 , trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : before_trim = directory ( 'results/ {sample} /fastqc_reports/before_trim/' ), after_trim = directory ( 'results/ {sample} /fastqc_reports/after_trim/' ) log : 'logs/ {sample} / {sample} _fastqc.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_fastqc.txt' resources : mem_gb = 1 threads : 2 shell : ''' echo \"Creating output directory <{output.before_trim}>\" > {log} mkdir -p {output.before_trim} 2>> {log} echo \"Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {output.before_trim} \\ --dir {output.before_trim} {input.reads1} {input.reads2} &>> {log} echo \"Results saved in <{output.before_trim}>\" >> {log} echo \"Creating output directory <{output.after_trim}>\" >> {log} mkdir -p {output.after_trim} 2>> {log} echo \"Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {output.after_trim} \\ --dir {output.after_trim} {input.trim1} {input.trim2} &>> {log} echo \"Results saved in <{output.after_trim}>\" >> {log} ''' .snakemake_timestamp When directory() is used, Snakemake creates an empty file called .snakemake_timestamp in the output directory. This is the marker it uses to know if it needs to re-run the rule producing the directory. Overall, this rule works well and allows for an easy rule definition. However, in this case, individual files are not explicitly named as outputs and this may cause problems to chain rules later. Also, remember that some applications won\u2019t give you any control at all over the outputs, which is why you need a back-up plan, i.e. solution 4: the most powerful solution is to use shell commands to move and/or rename the files to the names you want. Also, the Snakemake developers advise to use directory() as a last resort and to rather use the touch() flag instead. This option amounts to let FastQC follows its default behaviour but force the renaming of the files afterwards to obtain the exact outputs we require. Exercise: Implement a single rule to run FastQC on both the original and the trimmed FASTQ files (4 files in total) and rename the files created by FastQC to precise output names using the mv command. Answer This makes the rule definition (much) more complicated than the other solution: rule fastq_qc_sol4 : ''' This rule performs a QC on paired-end fastq files before and after trimming. ''' input : reads1 = rules . fastq_trim . input . reads1 , reads2 = rules . fastq_trim . input . reads2 , trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : # QC before trimming html1_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_1.html' , zipfile1_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_1.zip' , html2_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_2.html' , zipfile2_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_2.zip' , # QC after trimming html1_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_1.html' , zipfile1_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_1.zip' , html2_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_2.html' , zipfile2_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_2.zip' params : wd = 'results/ {sample} /fastqc_reports/' , # QC before trimming html1_before = 'results/ {sample} /fastqc_reports/ {sample} _1_fastqc.html' , zipfile1_before = 'results/ {sample} /fastqc_reports/ {sample} _1_fastqc.zip' , html2_before = 'results/ {sample} /fastqc_reports/ {sample} _2_fastqc.html' , zipfile2_before = 'results/ {sample} /fastqc_reports/ {sample} _2_fastqc.zip' , # QC after trimming html1_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_1_fastqc.html' , zipfile1_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_1_fastqc.zip' , html2_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_2_fastqc.html' , zipfile2_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_2_fastqc.zip' log : 'logs/ {sample} / {sample} _fastqc.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_fastqc.txt' resources : mem_gb = 1 threads : 2 shell : ''' echo \"Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {params.wd} \\ --dir {params.wd} {input.reads1} {input.reads2} &>> {log} echo \"Renaming results from original fastq analysis\" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_before} {output.html1_before} 2>> {log} mv {params.zipfile1_before} {output.zipfile1_before} 2>> {log} mv {params.html2_before} {output.html2_before} 2>> {log} mv {params.zipfile2_before} {output.zipfile2_before} 2>> {log} echo \"Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {params.wd} \\ --dir {params.wd} {input.trim1} {input.trim2} &>> {log} echo \"Renaming results from trimmed fastq analysis\" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_after} {output.html1_after} 2>> {log} mv {params.zipfile1_after} {output.zipfile1_after} 2>> {log} mv {params.html2_after} {output.html2_after} 2>> {log} mv {params.zipfile2_after} {output.zipfile2_after} 2>> {log} echo \"Results saved in \" >> {log} ''' This solution is very long and much more complicated than the other one. However, it makes up for the complexity by allowing a total control on what is happening: with this method, we can choose where the temporary files are saved and the names of the outputs. It could have been shortened by using -o . to tell FastQC to create the files in the current working directory instead of a specific one, but this would have created another problem: if we run multiple jobs in parallel, then Snakemake may potentially try to produce files from different jobs but with the same temporary destination. In this case, the different instances would be trying to write to the same temporary files at the same time, overwriting each other and corrupting the output files. Several interesting things are happening in both versions of this rule: Much like for the outputs, it is possible to refer to the inputs of a rule directly in another rule with the syntax rules..input. FastQC doesn\u2019t create the output directory by itself (other programs might insist that the output directory does not already exist), so we have to create it manually with mkdir in the shell command before running FastQC The -p flag of mkdir make parent directories as needed and does not return an error if the directory already exists Directory creation Remember that in most cases it is not necessary to manually create directories because Snakemake will do it for you. Even when using a directory( ) output, Snakemake will not create the directory itself but most applications will make the directory for you; FastQC is an exception. Hint If you want to make sure that a certain rule is executed before another, you can write the outputs of the first rule as inputs of the second one, even if you don\u2019t use them in the rule. For example, we could force the execution of FastQC before mapping the reads with only a few modifications to rule read_mapping : rule read_mapping: ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input: trim1 = rules.fastq_trim.output.trim1, trim2 = rules.fastq_trim.output.trim2, # Do not forget to add a comma here fastqc = rules.fastq_qc_sol4.output.html1_before # This single line will force the execution of FASTQC before read mapping output: sam = 'results/{sample}/{sample}_mapped_reads.sam', report = 'results/{sample}/{sample}_mapping_report.txt' params: index = 'resources/genome_indices/Scerevisiae_index' log: 'logs/{sample}/{sample}_mapping.log' benchmark: 'benchmarks/{sample}/{sample}_mapping.txt' resources: mem_gb = 2 threads: 4 shell: ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \\ -x {params.index} --threads {threads} \\ -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' Modularising a workflow If you keep developing a workflow long enough, you are bound to encounter some cluttering problems. Have a look at your current Snakefile: with only 5 rules, it is already almost 200 lines long. Imagine what happens when your workflow comprises dozens of rules?! The Snakefile may become messy and harder to maintain and edit. This is why it quickly becomes crucial to modularise your workflow; this is a common practice in programming in general. This approach also makes it easier to re-use pieces of workflow in the future. Modularisation comes at 4 different levels: The most fine-grained level are wrappers. Wrappers allow to quickly use popular tools and libraries in Snakemake workflows, thanks to the wrapper directive. Wrappers are automatically downloaded and deploy a conda environment when running the workflow, which increases reproducibility, however their implementation can sometimes be \u2018rigid\u2019 and you may have to write your own rule. See the official documentation for more explanations For larger, reusable parts belonging to the same workflow, it is recommended to write smaller snakefiles and include them into a main Snakefile with the include statement. Note that in this case, all rules share a common config file. See the official documentation for more explanations The next level of modularisation is provided via the module statement, which enables arbitrary combination and re-use of rules in the same workflow and between workflows. See the official documentation for more explanations Finally, Snakemake also provides a syntax to define subworkflows, but this syntax is currently being deprecated in favor of the module statement. See the official documentation for more explanations In this course, we will only use the 2 nd level of modularisation. In more details, the idea is to write a main Snakefile in workflow/Snakefile , to place the other snakefiles containing the rules in the subfolder workflow/rules (these \u2018sub-Snakefile\u2019 should end with .smk , the recommended file extension of Snakemake) and to tell Snakemake to import the modular snakefiles in the main Snakefile with the include: syntax. Rules organisation How to organize rules is up to you, but a common approach would be to create \u201cthematic\u201d modules, i.e. regroup rules involved in the same general step of the workflow. Exercise: Move your current Snakefile into the subfolder workflow/rules and rename it to read_mapping.smk . Then create a new Snakefile in workflow/ and import read_mapping.smk in it using the include syntax. You should also move the importation of the config file from the modular Snakefile to the main one. Answer We will solve this problem step by step. First, create the new file structure: mkdir workflow/rules # Create a new folder mv workflow/Snakefile workflow/rules/read_mapping.smk # Move and rename the modular snakefile touch workflow/Snakefile # Recreate the main Snakefile Then, fill the main Snakefile with include and configfile : ''' Main Snakefile of the RNAseq analysis workflow. This workflow can clean and map reads, and perform Differential Expression Analyses. ''' # Path of the config file configfile : 'config/config.yaml' # Rules to execute the workflow include : 'rules/read_mapping.smk' Finally, do not forget to remove the config file import ( configfile: 'config/config.yaml' ) from the snakefiles ( workflow/rules/read_mapping.smk ) Relative paths Includes are relative to the directory of the Snakefile in which they occur. For example, if the Snakefile resides in workflow , then Snakemake will search for the included snakefiles in workflow/path/to/other/snakefile , regardless of the working directory You can place snakefiles in a sub-directory without changing input and output paths, as these paths are relative to the working directory. However, you will need to edit paths to external scripts and conda environments, as these paths are relative to the snakefile from which they are called (this will be discussed in the last series of exercises) In practice, you can imagine that the line include: is replaced by the entire content of snakefile.smk in Snakefile . This means that syntaxes like rules..output. can still be used in snakefiles, even if the rule was defined in another snakefile, as long as the snakefile in which is defined is included before the snakefile that uses rules..output . This also works for input and output functions. Using a target rule and aggregating outputs Creating a target rule Modularisation also offers a great opportunity to facilitate the execution of the workflow. By default, if no target is given at the command line, Snakemake executes the first rule in the Snakefile. Hence, we have always executed the workflow by specifying a target file in the command line to avoid this behaviour. But we can actually use this property to make the execution easier by writing a pseudo-rule (also called target-rule and usually named rule all ) in the Snakefile which has all the desired outputs (or a particular subsets of them) files as input files. This rule will look like this: rule all : input : 'path/to/ouput1' , 'path/to/ouput2' Order of rules in Snakefile/snakefiles Apart from Snakemake considering the first rule of the workflow as the default target, the order of rules in the Snakefile/snakefiles is arbitrary and does not influence the DAG of jobs. Exercise: Implement a special rule in the Snakefile so that the final output is generated by default when running snakemake without specifying a target, then test your workflow with a dry-run. Hint Remember that a rule is not required to have an output nor a shell command The inputs of rule all should be the final outputs that you want to generate (those from the last rule you wrote) Answer If we consider that the last outputs are the ones produced by rule reads_quantification_genes , we can write the target rule like this: # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : 'results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv' Note that we used only one of the two outputs of rule reads_quantification_genes . We do this because it is enough to trigger the execution and if the rule didn\u2019t produce both outputs, Snakemake would crash and report it this error. Now, let\u2019s try to do a dry-run with this new rule: snakemake --cores 4 -F -r -p -n . You should see all the rules appearing thanks to the -F flag, including: localrule all: input: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv jobid: 0 reason: Input files updated by another job: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv resources: tmpdir = /tmp Job stats: job count -------------------------- ------- all 1 fastq_qc_sol4 1 fastq_trim 1 read_mapping 1 reads_quantification_genes 1 sam_to_bam 1 total 6 Aggregating outputs Using a target rule like the one presented in the previous paragraph gives another opportunity to make things easier. In the rule we just created, we used a hard-coded input and by now, you should know that this is not an optimal solution and that we should avoid this as much as possible, especially if you have many samples to process. To solve this problem, we will rely on the expand function . Exercise: Write an expand() syntax to generate a list of outputs from rule reads_quantification_genes with all the RNAseq samples . What do you need to write this? Answer The output of rule reads_quantification_genes has the following syntax: 'results/{sample}/{sample}_genes_read_quantification.tsv' . First, we need to create a Python list containing all the values that the {sample} wildcards can take: SAMPLES = ['highCO2_sample1', 'highCO2_sample2', 'highCO2_sample3', 'lowCO2_sample1', 'lowCO2_sample2', 'lowCO2_sample3'] Then, we can transform the output syntax with expand() : expand('results/{sample}/{sample}_genes_read_quantification.tsv', sample=SAMPLES) Exercise: Use these two elements (the list of samples and the expand() syntax) in the target rule to ask Snakemake to generate all the outputs. Answer You need to add the sample list to the Snakefile before the rule all and replace the value of the input directive: # Sample list SAMPLES = [ 'highCO2_sample1' , 'highCO2_sample2' , 'highCO2_sample3' , 'lowCO2_sample1' , 'lowCO2_sample2' , 'lowCO2_sample3' ] # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : expand ( 'results/ {sample} / {sample} _genes_read_quantification.tsv' , sample = SAMPLES ) If you launch the workflow in dry-run mode with this new rule: snakemake --cores 4 -F -r -p -n . You should see all the rules appearing 5 times (1 for each sample that hasn\u2019t been processed yet): localrule all: input: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv, results/highCO2_sample2/highCO2_sample2_genes_read_quantification.tsv, results/highCO2_sample3/highCO2_sample3_genes_read_quantification.tsv, results/lowCO2_sample1/lowCO2_sample1_genes_read_quantification.tsv, results/lowCO2_sample2/lowCO2_sample2_genes_read_quantification.tsv, results/lowCO2_sample3/lowCO2_sample3_genes_read_quantification.tsv jobid: 0 reason: Input files updated by another job: results/lowCO2_sample1/lowCO2_sample1_genes_read_quantification.tsv, results/lowCO2_sample2/lowCO2_sample2_genes_read_quantification.tsv, results/lowCO2_sample3/lowCO2_sample3_genes_read_quantification.tsv, results/highCO2_sample3/highCO2_sample3_genes_read_quantification.tsv, results/highCO2_sample2/highCO2_sample2_genes_read_quantification.tsv resources: tmpdir = /tmp Job stats: job count -------------------------- ------- all 1 fastq_qc_sol4 5 fastq_trim 5 read_mapping 5 reads_quantification_genes 5 sam_to_bam 5 total 26 But we can do even better! At the moment, samples are defined in a list at the top of the Snakefile. To further improve the workflow\u2019s usability, we can define samples in the config file, so they can easily be added, removed, or modified by the user. Exercise: Implement a parameter in the config file to specify sample names and modify rule all to use this parameter in the expand() syntax. Answer First, we need to modify the config file: # Configuration options of RNAseq-analysis workflow # Location of the genome indices index : 'resources/genome_indices/Scerevisiae_index' # Location of the annotation file annotations : 'resources/Scerevisiae.gtf' # Sample names samples : - highCO2_sample1 - highCO2_sample2 - highCO2_sample3 - lowCO2_sample1 - lowCO2_sample2 - lowCO2_sample3 Then, we need to use the config file in the expand() syntax (and remove SAMPLES from the Snakefile, because we don\u2019t need this variable anymore): # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : expand ( 'results/ {sample} / {sample} _genes_read_quantification.tsv' , sample = config [ 'samples' ]) Here, config['samples'] is a Python list containing strings, each string being a sample name. This is because a list of parameters become a list during the config file parsing. An even more Snakemake-idiomatic solution There is an even better and more Snakemake-idiomatic version of the expand() syntax: expand(rules.reads_quantification_genes.output.gene_level, sample=config['samples']) . While it may not seem easy to use and understand, this entirely removes the need to write the output paths! Running the other samples of the workflow Exercise: Touch the files already present in your workflow to avoid re-creating them and then run your workflow on the 5 other samples. Answer Touch the existing files: snakemake --cores 1 --touch Run the workflow snakemake --cores 4 -r -p Thanks to the parallelisation, the workflow execution should take less than 10 min in total to process all the samples! Exercise: Generate the workflow DAG and filegraph. Answer Generate the DAG: snakemake --cores 1 -F -r -p --rulegraph | dot -Tpng > images/all_samples_rulegraph.png Generate the filegraph: snakemake --cores 1 -F -r -p --filegraph | dot -Tpng > images/all_samples_filegraph.png Your DAG should resemble this: And this should be your filegraph:","title":"Decorating and optimising a Snakemake workflow"},{"location":"course_material/day2/4_decorating_workflow/#learning-outcomes","text":"After having completed this chapter you will be able to: Optimise a workflow by multi-threading Use non-file parameters and config files in rules Create rules with non-conventional outputs Modularise a workflow Make a workflow process a list of files rather than one file at a time","title":"Learning outcomes"},{"location":"course_material/day2/4_decorating_workflow/#exercises","text":"In this series of exercises, we will create only one new rule to add to our workflow, because this part aims mainly to show how to improve and \u2018decorate\u2019 the rules we previously wrote. Development and back-up During this session, we will modify our Snakefile quite heavily, so it may be a good idea to start by making a back-up: cp worklow/Snakefile worklow/Snakefile_backup . As a general rule, if you have a doubt on the code you are developing, do not hesitate to make a back-up.","title":"Exercises"},{"location":"course_material/day2/4_decorating_workflow/#optimising-a-workflow-by-multi-threading","text":"When working with real datasets, most processes are very long and computationally expensive. Fortunately, they can be parallelised very efficiently to decrease the computation time by using several threads for a single job. Exercise: Parallelise as much processes as possible using the threads directive and test its effect: Identify which software can make use of parallelisation Identify in each software the parameter that controls multi-threading Implement the multi-threading Hint Check the software documentation and parameters with the -h/--help flags Remember that multi-threading only applies to software that can make use of a threads parameters, Snakemake itself cannot parallelise a software automatically Remember that you need to add threads to the Snakemake rule but also to the commands! Just increasing the number of threads in Snakemake will not magically run a command with multiple threads Remember that you have 4 threads in total, so even if you ask for more in a rule, Snakemake will cap this value at 4. And if you use 4 threads in a rule, that means that no other job can run parallel! Answer It turns out that all the software except samtools index can handle multi-threading: atropos trim , hisat2 , samtools view , and samtools sort use the --threads option featureCounts uses the -T option Let\u2019s use 4 threads for the mapping step and 2 for the other steps. Your Snakefile should look like this: rule fastq_trim: ''' This rule trims paired-end reads to improve their quality. Specifically, it removes: - Low quality bases - A stretches longer than 20 bases - N stretches ''' input: reads1 = 'data/{sample}_1.fastq', reads2 = 'data/{sample}_2.fastq', output: trim1 = 'results/{sample}/{sample}_atropos_trimmed_1.fastq', trim2 = 'results/{sample}/{sample}_atropos_trimmed_2.fastq' log: 'logs/{sample}/{sample}_atropos_trimming.log' benchmark: 'benchmarks/{sample}/{sample}_atropos_trimming.txt' resources: mem_mb = 500 threads: 2 shell: ''' echo \"Trimming reads in <{input.reads1}> and <{input.reads2}>\" > {log} atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 \\ --no-cache-adapters -a \"A{{20}}\" -A \"A{{20}}\" --threads {threads} \\ -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log} echo \"Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively\" >> {log} echo \"Trimming report saved in <{log}>\" >> {log} ''' rule read_mapping: ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input: trim1 = rules.fastq_trim.output.trim1, trim2 = rules.fastq_trim.output.trim2 output: sam = 'results/{sample}/{sample}_mapped_reads.sam', report = 'results/{sample}/{sample}_mapping_report.txt' log: 'logs/{sample}/{sample}_mapping.log' benchmark: 'benchmarks/{sample}/{sample}_mapping.txt' resources: mem_gb = 2 threads: 4 shell: ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \\ -x resources/genome_indices/Scerevisiae_index --threads {threads} \\ -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} ''' rule sam_to_bam: ''' This rule converts a sam file to bam format, sorts it and indexes it. ''' input: sam = rules.read_mapping.output.sam output: bam = 'results/{sample}/{sample}_mapped_reads.bam', bam_sorted = 'results/{sample}/{sample}_mapped_reads_sorted.bam', index = 'results/{sample}/{sample}_mapped_reads_sorted.bam.bai' log: 'logs/{sample}/{sample}_mapping_sam_to_bam.log' benchmark: 'benchmarks/{sample}/{sample}_mapping_sam_to_bam.txt' resources: mem_mb = 250 threads: 2 shell: ''' echo \"Converting <{input.sam}> to BAM format\" > {log} samtools view {input.sam} --threads {threads} -b -o {output.bam} 2>> {log} echo \"Sorting BAM file\" >> {log} samtools sort {output.bam} --threads {threads} -O bam -o {output.bam_sorted} 2>> {log} echo \"Indexing the sorted BAM file\" >> {log} samtools index -b {output.bam_sorted} -o {output.index} 2>> {log} echo \"Sorted file saved in <{output.bam_sorted}>\" >> {log} ''' rule reads_quantification_genes: ''' This rule quantifies the reads of a bam file mapping on genes and produces a count table for all genes of the assembly. The strandedness parameter is determined by get_strandedness(). ''' input: bam_once_sorted = rules.sam_to_bam.output.bam_sorted, output: gene_level = 'results/{sample}/{sample}_genes_read_quantification.tsv', gene_summary = 'results/{sample}/{sample}_genes_read_quantification.summary' log: 'logs/{sample}/{sample}_genes_read_quantification.log' benchmark: 'benchmarks/{sample}/{sample}_genes_read_quantification.txt' resources: mem_mb = 500 threads: 2 shell: ''' echo \"Counting reads mapping on genes in <{input.bam_once_sorted}>\" > {log} featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \\ -a resources/Scerevisiae.gtf -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} echo \"Renaming output files\" >> {log} mv {output.gene_level}.summary {output.gene_summary} echo \"Results saved in <{output.gene_level}>\" >> {log} echo \"Report saved in <{output.gene_summary}>\" >> {log} ''' Exercise: Finally, test the effect of the number of threads on the workflow\u2019s runtime. What command will you use to run the workflow? Does the workflow run faster? Answer The command to use is: snakemake --cores 4 -F -r -p results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv Do not forget to provide additional cores to Snakemake in the execution command with --cores 4 . Note that the number of threads allocated to all jobs running at a given time cannot exceed the value specified with --cores . Therefore, if you leave this number at 1, Snakemake will not be able to use multiple threads. Also note that increasing --cores allows Snakemake to run multiple jobs in parallel (for example, running 2 jobs using 2 threads each). The workflow now takes ~6 min to run, compared to ~10 min before ( i.e. a 40% decrease!). This gives you an idea of how powerful multi-threading is when the datasets and computing power get bigger! Explicit is better than implicit Even if a software cannot multi-thread, it is useful to add threads: 1 in the rule to keep the rule consistency and clearly state that the software works with a single thread. Things to keep in mind when using parallel execution Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash The on-screen output from parallel jobs will be mixed, so save any output to log files instead","title":"Optimising a workflow by multi-threading"},{"location":"course_material/day2/4_decorating_workflow/#using-non-file-parameters-and-config-files","text":"","title":"Using non-file parameters and config files"},{"location":"course_material/day2/4_decorating_workflow/#non-file-parameters","text":"As we have seen, Snakemake\u2019s execution is based around inputs and outputs of each step of the workflow. However, a lot of software rely on additional non-file parameters. In the previous presentation and series of exercises, we advocated (rightfully so!) against using hard-coded filepaths. Yet, if you look back at the rules we have implemented, you will find 2 occurrences of this behaviour in the shell command: In the rule read_mapping , the index parameter -x resources/genome_indices/Scerevisiae_index In the rule reads_quantification_genes , the annotation parameter -a resources/Scerevisiae.gtf This reduces readability and also makes it very hard to change the value of these parameters. The params directive was designed for this purpose: it allows to specify additional parameters that can also depend on the wildcards values and use input functions (see Session 4 for more information on this). params values can be of any type (integer, string, list etc\u2026) and similarly to the {input} and {output} placeholders, they can also be accessed from the shell command with the placeholder {params} . Just like for the input and output directives, you can define multiple parameters (in this case, do not forget the comma between each entry!) and they can be named (in practice, unknown parameters are unexplicit and easily confusing, so parameters should always be named!). It also helps readability and clarity to use the params section to name and assign parameters and variables for your shell command. Here is an example on how to use params : rule example : input : 'data/example.tsv' output : 'results/example.txt' params : lines = 5 shell : 'head -n {params.lines} {input} > {output} ' Parameters arguments In contrast to the input directive, the params directive can optionally take more arguments than only wildcards , namely input , output , threads , and resources . Exercise: Replace the two hard-coded paths mentioned earlier by params . Hint Add a params directive to the rules, name the parameter and replace the path by the placeholder in the shell command. Answer Note: for clarity, only the lines that changed are shown below. rule read_mapping params : index = 'resources/genome_indices/Scerevisiae_index' shell : 'hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \\ -x {params.index} --threads {threads} \\ -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} ' rule reads_quantification_genes params : annotations = 'resources/Scerevisiae.gtf' shell : 'featureCounts -t exon -g gene_id -s 2 -p -B -C --largestOverlap --verbose -F GTF \\ -a {params.annotations} -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log} ' Snakemake re-run behaviour If you try to re-run only the last rule with snakemake --cores 4 -r -p -f results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv , Snakemake will actually try to re-run 3 rules in total. This is because the code changed in 2 rules (see reason field in Snakemake\u2019s log), which triggered an update of the inputs in the 3rd rule ( sam_to_bam ). To avoid this, first touch the files with snakemake --cores 1 --touch -F results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv then re-run the last rule.","title":"Non-file parameters"},{"location":"course_material/day2/4_decorating_workflow/#config-files","text":"That being said, there is an even better way to handle parameters like the we just modified: instead of hard-coding parameter values in the Snakefile, Snakemake allows to define parameters and their values in config files. The config files will be parsed by Snakemake when executing the workflow, and parameters and their values will be stored in a Python dictionary named config . The path to the config file can be specified either in the Snakefile with the line configfile: at the top of the file, or it can be specified at runtime with the execution parameter --configfile . Config files are stored in the config subfolder and written in the JSON or YAML format. We will use the latter for this course as it is the most user-friendly and the recommended one. Briefly, in the YAML format, parameters are defined with the syntax : . Values can be strings, integers, floating points, booleans \u2026 For a complete overview of available value types, see this list . A parameter can have multiple values, which are then each listed on an indented single line starting with \u201c - \u201d. These values will be stored in a Python list when Snakemake parses the config file. Finally, parameters can be nested on indented single lines, and they will be stored as a dictionary when Snakemake parses the config file. The example below shows a parameter with a single value ( lines_number ), a parameter with multiple values ( samples ), and an example of nested parameters ( resources ): # Parameter with a single value (string, int, float, bool ...) lines_number : 5 # Parameter with multiple values samples : - sample1 - sample2 # Nested parameters resources : threads : 4 memory : 4G Then, each parameter can be accessed in Snakefile with the following syntax: config [ 'lines_number' ] # --> 5 config [ 'samples' ] # --> ['sample1', 'sample2'] # Lists of parameters become list config [ 'resources' ] # --> {'threads': 4, 'memory': '4G'} # Lists of named parameters become dictionaries config [ 'resources' ][ 'threads' ] # --> 4 Accessing config values in shell Values stored in the config dictionary cannot be accessed directly within the shell directive. If you need to use a parameter value in shell , define the parameter in params and assign its value from the config dictionary. Exercise: Create a config file in YAML format and fill it with adapted variables and values to replace the 2 hard-coded parameters in rules read_mapping and reads_quantification_genes . Then replace the hard-coded parameters by values from the config file and add its path on top of your Snakefile. Answer Note: for clarity, only the lines that changed are shown below. The first step is to create the subfolder and an empty config file: mkdir config # Create a new folder touch config/config.yaml # Create an empty config file Then, fill the config file with the desired values: # Configuration options of RNAseq-analysis workflow # Location of the genome indices index : 'resources/genome_indices/Scerevisiae_index' # Location of the annotation file annotations : 'resources/Scerevisiae.gtf' Then, replace the params values in the Snakefile: rule read_mapping params : index = config [ 'index' ] rule reads_quantification_genes params : annotations = config [ 'annotations' ] Finally, add the file path on top of the Snakefile: configfile: 'config/config.yaml' Now, if we need to change these values, we can easily do it in the config file instead of modifying the code!","title":"Config files"},{"location":"course_material/day2/4_decorating_workflow/#using-non-conventional-outputs","text":"Snakemake has several built-in utilities to assign properties to outputs that are deemed \u2018special\u2019. These properties are listed in the table below: Property Syntax Function Temporary temp('path/to/file.txt') File is deleted as soon as it is not required by any future jobs Protected protected('path/to/file.txt') File cannot be overwritten after the job ends (useful to prevent erasing a file by mistake) Ancient ancient('path/to/file.txt') File will not be re-created when running the pipeline (useful for files that require heavy computation) Directory directory('path/to/directory') Output is a directory instead of a file (use \u2018touch\u2019 instead if possible) Touch touch('path/to/file.txt') Create an empty flag file \u2018file.txt\u2019 regardless of the shell command (if the command finished without errors) The next paragraphs will show how to use some of these properties.","title":"Using non-conventional outputs"},{"location":"course_material/day2/4_decorating_workflow/#use-case-of-the-temp-command","text":"Exercise: Can you think of a convenient use of temp() command? Answer The temp() command is extremely useful to automatically remove intermediary outputs that are no longer needed. Exercise: In your workflow, identify outputs that are intermediary and mark them as temporary with temp() . Answer The unsorted .bam and the .sam outputs seem like great candidates to be marked as temporary. One could also argue that the trimmed FASTQ files are also temporary, but we will keep them for now. Note: for clarity, only the lines that changed are shown below. rule read_mapping output : sam = temp ( 'results/ {sample} / {sample} _mapped_reads.sam' ), rule sam_to_bam output : bam = temp ( 'results/ {sample} / {sample} _mapped_reads.bam' ), Consequences of using temp() Removing temporary outputs is a great way to save a lot of storage space. If you look at the size of your current results/ folder ( du -bchd0 results/ ), you will notice that it drastically. Just removing these two files would allow to save ~1 GB. While it may not seem a lot, remember that you usually have much bigger files and many more samples! On the other hand, using temporary outputs might force you to re-run more jobs than necessary if an input changes, so carefully think about it before using it. Exercise: On the contrary, is there a file of your workflow that you would like to protect with protected() Answer This is debatable, but one could argue that the sorted .bam file is a good candidate for protection. rule sam_to_bam output : bam_sorted = protected ( 'results/ {sample} / {sample} _mapped_reads_sorted.bam' ), If you set this output as protected, be careful when you want to re-run your workflow and recreate the file!","title":"Use-case of the temp() command"},{"location":"course_material/day2/4_decorating_workflow/#use-case-of-the-directory-command-the-fastqc-example","text":"FastQC is a program designed to spot potential problems in high-througput sequencing datasets. It is a very popular tool, notably because it runs quickly and does not require a lot of configuration. It runs a set of analyses on one or more raw sequence files in FASTQ or BAM format and produces a report with quality plots that summarises the results. It will highlight any areas where a dataset looks unusual and might require a closer look. FastQC can be run interactively or in batch mode, during which it saves results as an HTML file and a ZIP file. We will soon see that running FastQC in batch mode presents a little problem. Data types and FastQC FastQC does not differentiate between sequencing techniques and as such can be used to look at libraries coming from a large number of experiments (Genomic Sequencing, ChIP-Seq, RNAseq, BS-Seq etc\u2026). If you run fastqc -h , you will notice something a bit surprising (but not unusual in bioinformatics): -o --outdir Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed. -d --dir Selects a directory to be used for temporary files written when generating report images. Defaults to system temp directory if not specified. Two files are produced for each FASTQ file and these files appear in the same directory as the input file: FastQC does not allow to specify the names of the output files! However, we can set an alternative output directory, even though it needs to be created before FastQC is run. There are different solutions to deal with this problem: Work with the default file names produced by FastQC and leave the reports in the same directory than the input files Create the outputs in a new directory and leave the reports with their default name Create the outputs in a new directory and tell Snakemake that the directory itself is the output Force a naming convention by renaming the FastQC output files within the rule For the sake of time, we will not test all 4 solutions, but rather try to apply the 3 rd or the 4 th solution. We\u2019ll briefly summarise solutions 1 and 2 here: This could work, but it\u2019s better not to put the reports in the same directory than the input sequences. As a general principle, when writing Snakemake rules, we prefer to be in charge of the output names and to have all the files linked to a sample in the same directory This involves manually constructing the output directory path to use with the -o option, which works but isn\u2019t very convenient The base of the FastQC command is the following: fastqc --format fastq --threads 2 -t/--threads : specify the number of files which can be processed simultaneously. Here, it will be 2 because the inputs are paired-end files The -o and -d will be used in the last 2 solutions that we will now see in details We will create a single rule to run FastQC on both the original and the trimmed FASTQ files Choose only one solution to implement: Solution 3 Solution 4 This option amounts to tell Snakemake not to worry about individual files at all and consider the output of the rule as an entire directory. Exercise: Implement a single rule to run FastQC on both the original and the trimmed FASTQ files (4 files in total) using directories as ouputs with the directory() command. Answer This makes the rule definition quite \u2018simple\u2019: rule fastq_qc_sol3 : ''' This rule performs a QC on paired-end fastq files before and after trimming. ''' input : reads1 = rules . fastq_trim . input . reads1 , reads2 = rules . fastq_trim . input . reads2 , trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : before_trim = directory ( 'results/ {sample} /fastqc_reports/before_trim/' ), after_trim = directory ( 'results/ {sample} /fastqc_reports/after_trim/' ) log : 'logs/ {sample} / {sample} _fastqc.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_fastqc.txt' resources : mem_gb = 1 threads : 2 shell : ''' echo \"Creating output directory <{output.before_trim}>\" > {log} mkdir -p {output.before_trim} 2>> {log} echo \"Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {output.before_trim} \\ --dir {output.before_trim} {input.reads1} {input.reads2} &>> {log} echo \"Results saved in <{output.before_trim}>\" >> {log} echo \"Creating output directory <{output.after_trim}>\" >> {log} mkdir -p {output.after_trim} 2>> {log} echo \"Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {output.after_trim} \\ --dir {output.after_trim} {input.trim1} {input.trim2} &>> {log} echo \"Results saved in <{output.after_trim}>\" >> {log} ''' .snakemake_timestamp When directory() is used, Snakemake creates an empty file called .snakemake_timestamp in the output directory. This is the marker it uses to know if it needs to re-run the rule producing the directory. Overall, this rule works well and allows for an easy rule definition. However, in this case, individual files are not explicitly named as outputs and this may cause problems to chain rules later. Also, remember that some applications won\u2019t give you any control at all over the outputs, which is why you need a back-up plan, i.e. solution 4: the most powerful solution is to use shell commands to move and/or rename the files to the names you want. Also, the Snakemake developers advise to use directory() as a last resort and to rather use the touch() flag instead. This option amounts to let FastQC follows its default behaviour but force the renaming of the files afterwards to obtain the exact outputs we require. Exercise: Implement a single rule to run FastQC on both the original and the trimmed FASTQ files (4 files in total) and rename the files created by FastQC to precise output names using the mv command. Answer This makes the rule definition (much) more complicated than the other solution: rule fastq_qc_sol4 : ''' This rule performs a QC on paired-end fastq files before and after trimming. ''' input : reads1 = rules . fastq_trim . input . reads1 , reads2 = rules . fastq_trim . input . reads2 , trim1 = rules . fastq_trim . output . trim1 , trim2 = rules . fastq_trim . output . trim2 output : # QC before trimming html1_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_1.html' , zipfile1_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_1.zip' , html2_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_2.html' , zipfile2_before = 'results/ {sample} /fastqc_reports/ {sample} _before_trim_2.zip' , # QC after trimming html1_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_1.html' , zipfile1_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_1.zip' , html2_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_2.html' , zipfile2_after = 'results/ {sample} /fastqc_reports/ {sample} _after_trim_2.zip' params : wd = 'results/ {sample} /fastqc_reports/' , # QC before trimming html1_before = 'results/ {sample} /fastqc_reports/ {sample} _1_fastqc.html' , zipfile1_before = 'results/ {sample} /fastqc_reports/ {sample} _1_fastqc.zip' , html2_before = 'results/ {sample} /fastqc_reports/ {sample} _2_fastqc.html' , zipfile2_before = 'results/ {sample} /fastqc_reports/ {sample} _2_fastqc.zip' , # QC after trimming html1_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_1_fastqc.html' , zipfile1_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_1_fastqc.zip' , html2_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_2_fastqc.html' , zipfile2_after = 'results/ {sample} /fastqc_reports/ {sample} _atropos_trimmed_2_fastqc.zip' log : 'logs/ {sample} / {sample} _fastqc.log' benchmark : 'benchmarks/ {sample} / {sample} _atropos_fastqc.txt' resources : mem_gb = 1 threads : 2 shell : ''' echo \"Performing QC of reads before trimming in <{input.reads1}> and <{input.reads2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {params.wd} \\ --dir {params.wd} {input.reads1} {input.reads2} &>> {log} echo \"Renaming results from original fastq analysis\" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_before} {output.html1_before} 2>> {log} mv {params.zipfile1_before} {output.zipfile1_before} 2>> {log} mv {params.html2_before} {output.html2_before} 2>> {log} mv {params.zipfile2_before} {output.zipfile2_before} 2>> {log} echo \"Performing QC of reads after trimming in <{input.trim1}> and <{input.trim2}>\" >> {log} fastqc --format fastq --threads {threads} --outdir {params.wd} \\ --dir {params.wd} {input.trim1} {input.trim2} &>> {log} echo \"Renaming results from trimmed fastq analysis\" >> {log} # Renames files because we can't choose fastqc output mv {params.html1_after} {output.html1_after} 2>> {log} mv {params.zipfile1_after} {output.zipfile1_after} 2>> {log} mv {params.html2_after} {output.html2_after} 2>> {log} mv {params.zipfile2_after} {output.zipfile2_after} 2>> {log} echo \"Results saved in \" >> {log} ''' This solution is very long and much more complicated than the other one. However, it makes up for the complexity by allowing a total control on what is happening: with this method, we can choose where the temporary files are saved and the names of the outputs. It could have been shortened by using -o . to tell FastQC to create the files in the current working directory instead of a specific one, but this would have created another problem: if we run multiple jobs in parallel, then Snakemake may potentially try to produce files from different jobs but with the same temporary destination. In this case, the different instances would be trying to write to the same temporary files at the same time, overwriting each other and corrupting the output files. Several interesting things are happening in both versions of this rule: Much like for the outputs, it is possible to refer to the inputs of a rule directly in another rule with the syntax rules..input. FastQC doesn\u2019t create the output directory by itself (other programs might insist that the output directory does not already exist), so we have to create it manually with mkdir in the shell command before running FastQC The -p flag of mkdir make parent directories as needed and does not return an error if the directory already exists Directory creation Remember that in most cases it is not necessary to manually create directories because Snakemake will do it for you. Even when using a directory( ) output, Snakemake will not create the directory itself but most applications will make the directory for you; FastQC is an exception. Hint If you want to make sure that a certain rule is executed before another, you can write the outputs of the first rule as inputs of the second one, even if you don\u2019t use them in the rule. For example, we could force the execution of FastQC before mapping the reads with only a few modifications to rule read_mapping : rule read_mapping: ''' This rule maps trimmed reads of a fastq on a reference assembly. ''' input: trim1 = rules.fastq_trim.output.trim1, trim2 = rules.fastq_trim.output.trim2, # Do not forget to add a comma here fastqc = rules.fastq_qc_sol4.output.html1_before # This single line will force the execution of FASTQC before read mapping output: sam = 'results/{sample}/{sample}_mapped_reads.sam', report = 'results/{sample}/{sample}_mapping_report.txt' params: index = 'resources/genome_indices/Scerevisiae_index' log: 'logs/{sample}/{sample}_mapping.log' benchmark: 'benchmarks/{sample}/{sample}_mapping.txt' resources: mem_gb = 2 threads: 4 shell: ''' echo \"Mapping the reads\" > {log} hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \\ -x {params.index} --threads {threads} \\ -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log} echo \"Mapped reads saved in <{output.sam}>\" >> {log} echo \"Mapping report saved in <{output.report}>\" >> {log} '''","title":"Use-case of the directory() command: the FastQC example"},{"location":"course_material/day2/4_decorating_workflow/#modularising-a-workflow","text":"If you keep developing a workflow long enough, you are bound to encounter some cluttering problems. Have a look at your current Snakefile: with only 5 rules, it is already almost 200 lines long. Imagine what happens when your workflow comprises dozens of rules?! The Snakefile may become messy and harder to maintain and edit. This is why it quickly becomes crucial to modularise your workflow; this is a common practice in programming in general. This approach also makes it easier to re-use pieces of workflow in the future. Modularisation comes at 4 different levels: The most fine-grained level are wrappers. Wrappers allow to quickly use popular tools and libraries in Snakemake workflows, thanks to the wrapper directive. Wrappers are automatically downloaded and deploy a conda environment when running the workflow, which increases reproducibility, however their implementation can sometimes be \u2018rigid\u2019 and you may have to write your own rule. See the official documentation for more explanations For larger, reusable parts belonging to the same workflow, it is recommended to write smaller snakefiles and include them into a main Snakefile with the include statement. Note that in this case, all rules share a common config file. See the official documentation for more explanations The next level of modularisation is provided via the module statement, which enables arbitrary combination and re-use of rules in the same workflow and between workflows. See the official documentation for more explanations Finally, Snakemake also provides a syntax to define subworkflows, but this syntax is currently being deprecated in favor of the module statement. See the official documentation for more explanations In this course, we will only use the 2 nd level of modularisation. In more details, the idea is to write a main Snakefile in workflow/Snakefile , to place the other snakefiles containing the rules in the subfolder workflow/rules (these \u2018sub-Snakefile\u2019 should end with .smk , the recommended file extension of Snakemake) and to tell Snakemake to import the modular snakefiles in the main Snakefile with the include: syntax. Rules organisation How to organize rules is up to you, but a common approach would be to create \u201cthematic\u201d modules, i.e. regroup rules involved in the same general step of the workflow. Exercise: Move your current Snakefile into the subfolder workflow/rules and rename it to read_mapping.smk . Then create a new Snakefile in workflow/ and import read_mapping.smk in it using the include syntax. You should also move the importation of the config file from the modular Snakefile to the main one. Answer We will solve this problem step by step. First, create the new file structure: mkdir workflow/rules # Create a new folder mv workflow/Snakefile workflow/rules/read_mapping.smk # Move and rename the modular snakefile touch workflow/Snakefile # Recreate the main Snakefile Then, fill the main Snakefile with include and configfile : ''' Main Snakefile of the RNAseq analysis workflow. This workflow can clean and map reads, and perform Differential Expression Analyses. ''' # Path of the config file configfile : 'config/config.yaml' # Rules to execute the workflow include : 'rules/read_mapping.smk' Finally, do not forget to remove the config file import ( configfile: 'config/config.yaml' ) from the snakefiles ( workflow/rules/read_mapping.smk ) Relative paths Includes are relative to the directory of the Snakefile in which they occur. For example, if the Snakefile resides in workflow , then Snakemake will search for the included snakefiles in workflow/path/to/other/snakefile , regardless of the working directory You can place snakefiles in a sub-directory without changing input and output paths, as these paths are relative to the working directory. However, you will need to edit paths to external scripts and conda environments, as these paths are relative to the snakefile from which they are called (this will be discussed in the last series of exercises) In practice, you can imagine that the line include: is replaced by the entire content of snakefile.smk in Snakefile . This means that syntaxes like rules..output. can still be used in snakefiles, even if the rule was defined in another snakefile, as long as the snakefile in which is defined is included before the snakefile that uses rules..output . This also works for input and output functions.","title":"Modularising a workflow"},{"location":"course_material/day2/4_decorating_workflow/#using-a-target-rule-and-aggregating-outputs","text":"","title":"Using a target rule and aggregating outputs"},{"location":"course_material/day2/4_decorating_workflow/#creating-a-target-rule","text":"Modularisation also offers a great opportunity to facilitate the execution of the workflow. By default, if no target is given at the command line, Snakemake executes the first rule in the Snakefile. Hence, we have always executed the workflow by specifying a target file in the command line to avoid this behaviour. But we can actually use this property to make the execution easier by writing a pseudo-rule (also called target-rule and usually named rule all ) in the Snakefile which has all the desired outputs (or a particular subsets of them) files as input files. This rule will look like this: rule all : input : 'path/to/ouput1' , 'path/to/ouput2' Order of rules in Snakefile/snakefiles Apart from Snakemake considering the first rule of the workflow as the default target, the order of rules in the Snakefile/snakefiles is arbitrary and does not influence the DAG of jobs. Exercise: Implement a special rule in the Snakefile so that the final output is generated by default when running snakemake without specifying a target, then test your workflow with a dry-run. Hint Remember that a rule is not required to have an output nor a shell command The inputs of rule all should be the final outputs that you want to generate (those from the last rule you wrote) Answer If we consider that the last outputs are the ones produced by rule reads_quantification_genes , we can write the target rule like this: # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : 'results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv' Note that we used only one of the two outputs of rule reads_quantification_genes . We do this because it is enough to trigger the execution and if the rule didn\u2019t produce both outputs, Snakemake would crash and report it this error. Now, let\u2019s try to do a dry-run with this new rule: snakemake --cores 4 -F -r -p -n . You should see all the rules appearing thanks to the -F flag, including: localrule all: input: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv jobid: 0 reason: Input files updated by another job: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv resources: tmpdir = /tmp Job stats: job count -------------------------- ------- all 1 fastq_qc_sol4 1 fastq_trim 1 read_mapping 1 reads_quantification_genes 1 sam_to_bam 1 total 6","title":"Creating a target rule"},{"location":"course_material/day2/4_decorating_workflow/#aggregating-outputs","text":"Using a target rule like the one presented in the previous paragraph gives another opportunity to make things easier. In the rule we just created, we used a hard-coded input and by now, you should know that this is not an optimal solution and that we should avoid this as much as possible, especially if you have many samples to process. To solve this problem, we will rely on the expand function . Exercise: Write an expand() syntax to generate a list of outputs from rule reads_quantification_genes with all the RNAseq samples . What do you need to write this? Answer The output of rule reads_quantification_genes has the following syntax: 'results/{sample}/{sample}_genes_read_quantification.tsv' . First, we need to create a Python list containing all the values that the {sample} wildcards can take: SAMPLES = ['highCO2_sample1', 'highCO2_sample2', 'highCO2_sample3', 'lowCO2_sample1', 'lowCO2_sample2', 'lowCO2_sample3'] Then, we can transform the output syntax with expand() : expand('results/{sample}/{sample}_genes_read_quantification.tsv', sample=SAMPLES) Exercise: Use these two elements (the list of samples and the expand() syntax) in the target rule to ask Snakemake to generate all the outputs. Answer You need to add the sample list to the Snakefile before the rule all and replace the value of the input directive: # Sample list SAMPLES = [ 'highCO2_sample1' , 'highCO2_sample2' , 'highCO2_sample3' , 'lowCO2_sample1' , 'lowCO2_sample2' , 'lowCO2_sample3' ] # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : expand ( 'results/ {sample} / {sample} _genes_read_quantification.tsv' , sample = SAMPLES ) If you launch the workflow in dry-run mode with this new rule: snakemake --cores 4 -F -r -p -n . You should see all the rules appearing 5 times (1 for each sample that hasn\u2019t been processed yet): localrule all: input: results/highCO2_sample1/highCO2_sample1_genes_read_quantification.tsv, results/highCO2_sample2/highCO2_sample2_genes_read_quantification.tsv, results/highCO2_sample3/highCO2_sample3_genes_read_quantification.tsv, results/lowCO2_sample1/lowCO2_sample1_genes_read_quantification.tsv, results/lowCO2_sample2/lowCO2_sample2_genes_read_quantification.tsv, results/lowCO2_sample3/lowCO2_sample3_genes_read_quantification.tsv jobid: 0 reason: Input files updated by another job: results/lowCO2_sample1/lowCO2_sample1_genes_read_quantification.tsv, results/lowCO2_sample2/lowCO2_sample2_genes_read_quantification.tsv, results/lowCO2_sample3/lowCO2_sample3_genes_read_quantification.tsv, results/highCO2_sample3/highCO2_sample3_genes_read_quantification.tsv, results/highCO2_sample2/highCO2_sample2_genes_read_quantification.tsv resources: tmpdir = /tmp Job stats: job count -------------------------- ------- all 1 fastq_qc_sol4 5 fastq_trim 5 read_mapping 5 reads_quantification_genes 5 sam_to_bam 5 total 26 But we can do even better! At the moment, samples are defined in a list at the top of the Snakefile. To further improve the workflow\u2019s usability, we can define samples in the config file, so they can easily be added, removed, or modified by the user. Exercise: Implement a parameter in the config file to specify sample names and modify rule all to use this parameter in the expand() syntax. Answer First, we need to modify the config file: # Configuration options of RNAseq-analysis workflow # Location of the genome indices index : 'resources/genome_indices/Scerevisiae_index' # Location of the annotation file annotations : 'resources/Scerevisiae.gtf' # Sample names samples : - highCO2_sample1 - highCO2_sample2 - highCO2_sample3 - lowCO2_sample1 - lowCO2_sample2 - lowCO2_sample3 Then, we need to use the config file in the expand() syntax (and remove SAMPLES from the Snakefile, because we don\u2019t need this variable anymore): # Master rule that launches the workflow rule all : ''' Dummy rule to automatically generate the required outputs. ''' input : expand ( 'results/ {sample} / {sample} _genes_read_quantification.tsv' , sample = config [ 'samples' ]) Here, config['samples'] is a Python list containing strings, each string being a sample name. This is because a list of parameters become a list during the config file parsing. An even more Snakemake-idiomatic solution There is an even better and more Snakemake-idiomatic version of the expand() syntax: expand(rules.reads_quantification_genes.output.gene_level, sample=config['samples']) . While it may not seem easy to use and understand, this entirely removes the need to write the output paths!","title":"Aggregating outputs"},{"location":"course_material/day2/4_decorating_workflow/#running-the-other-samples-of-the-workflow","text":"Exercise: Touch the files already present in your workflow to avoid re-creating them and then run your workflow on the 5 other samples. Answer Touch the existing files: snakemake --cores 1 --touch Run the workflow snakemake --cores 4 -r -p Thanks to the parallelisation, the workflow execution should take less than 10 min in total to process all the samples! Exercise: Generate the workflow DAG and filegraph. Answer Generate the DAG: snakemake --cores 1 -F -r -p --rulegraph | dot -Tpng > images/all_samples_rulegraph.png Generate the filegraph: snakemake --cores 1 -F -r -p --filegraph | dot -Tpng > images/all_samples_filegraph.png Your DAG should resemble this: And this should be your filegraph:","title":"Running the other samples of the workflow"}]} \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 22fef6b..b99fb25 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,57 +2,57 @@ None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily None - 2023-10-09 + 2023-10-10 daily \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 4b857c18c3579c82963ce02bbe63c07d4ad35ab1..21c6cf61be8c4fb7af1663d89129b671cd11110a 100644 GIT binary patch literal 203 zcmV;+05ty}iwFqH0VQPu|8r?{Wo=<_E_iKh0PU2|4#FS|#_v7_;Xb;}iyFpm9zE#; z5QZBGg9+&L?PY&r_6&xmN!zbq`n#3Z?_Q&qbY59g;ezBCNh55dOk3gG^_(4W&35$Z zw3h)zv+)(~LKsc}<2Vwpg6#R=iFGY_(Z#{TDnd?A=#>LA4d^VvVFQb=w8^ zp|=ey^KzdRdCX{8wT(@5+_duUkvGQkr(!4v#H;Y*9pnSB8