Skip to content

Commit

Permalink
Deployed ce8002a to 2023.10 with MkDocs 1.4.1 and mike 1.1.2
Browse files Browse the repository at this point in the history
  • Loading branch information
Geert van Geest committed Oct 3, 2023
1 parent 9646393 commit 4f39cdf
Show file tree
Hide file tree
Showing 5 changed files with 48 additions and 64 deletions.
60 changes: 32 additions & 28 deletions 2023.10/course_material/day2/introduction_snakemake/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -464,15 +464,15 @@
</li>

<li class="md-nav__item">
<a href="#using-the-input-directive" class="md-nav__link">
Using the input directive
<a href="#adding-an-input-directive" class="md-nav__link">
Adding an input directive
</a>

</li>

<li class="md-nav__item">
<a href="#using-several-rules-in-a-workflow" class="md-nav__link">
Using several rules in a workflow
<a href="#creating-a-workflow-with-several-rules" class="md-nav__link">
Creating a workflow with several rules
</a>

</li>
Expand Down Expand Up @@ -592,15 +592,15 @@
</li>

<li class="md-nav__item">
<a href="#using-the-input-directive" class="md-nav__link">
Using the input directive
<a href="#adding-an-input-directive" class="md-nav__link">
Adding an input directive
</a>

</li>

<li class="md-nav__item">
<a href="#using-several-rules-in-a-workflow" class="md-nav__link">
Using several rules in a workflow
<a href="#creating-a-workflow-with-several-rules" class="md-nav__link">
Creating a workflow with several rules
</a>

</li>
Expand Down Expand Up @@ -680,12 +680,13 @@ <h2 id="exercises">Exercises</h2>
<p>If you try to run a command and get an error such as <code>Command 'snakemake' not found</code>, you are probably not in the right environment. To list them, use <code>mamba env list</code>. Then activate the right environment with <code>mamba activate &lt;env_name&gt;</code>. You can deactivate an environment with <code>mamba deactivate</code>. To list the packages installed in an environment, activate it and use <code>mamba list</code>.</p>
</div>
<h3 id="workflow-structure">Workflow structure</h3>
<p>It is strongly advised to implement your answers in a directory called <code>workflow</code> (the reason for this will be explained later). You are free to chose the names and location of files for the different steps of your workflow, but we recommend that you at least group all outputs from the workflow in a <code>results</code> directory within the <code>workflow</code> directory.</p>
<p>It is advised to implement your answers in a directory called <code>workflow</code> (more on this later). You are free to choose the names and location of files for the different steps of your workflow, but, for now, we recommend that you at least group all outputs from the workflow in a <code>results</code> directory within the <code>workflow</code> directory.</p>
<h3 id="creating-a-basic-rule">Creating a basic rule</h3>
<p>Rules are the basic blocks of a Snakemake workflow. A <strong>rule</strong> is like a recipe indicating how to produce a specific <strong>output</strong>; the actual application of a rule to create an output is called a <strong>job</strong>. A rule is defined in a Snakefile with the <em>keyword</em> <code>rule</code>, and contains <em>directives</em> which indicate the rule&rsquo;s properties. We will learn about other directives later in the course.</p>
<p>To create the simplest rule possible, we need at least two <em>directives</em>:
<p>Rules are the basic blocks of a Snakemake workflow. A <strong>rule</strong> is like a recipe indicating how to produce a specific <strong>output</strong> . The actual application of a rule to create an output is called a <strong>job</strong>. A rule is defined in a Snakefile with the <em>keyword</em> <code>rule</code>, and contains <em>directives</em> which indicate the rule&rsquo;s properties.</p>
<p>To create the simplest rule possible, you need at least two <em>directives</em>:
- <code>output</code>: path of the output file for this rule
- <code>shell</code>: shell commands to execute in order to generate the output</p>
<p>You will see other directives later in the course.</p>
<p><strong>Exercise:</strong> The following example shows the minimal syntax to implement a rule. What do you think it does? Does it create a file? If so, how is it called?</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">first_step</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
Expand All @@ -700,12 +701,12 @@ <h3 id="creating-a-basic-rule">Creating a basic rule</h3>
<p>Rules are defined and written in a file called <strong>Snakefile</strong> (note the capital <code>S</code> and the absence of extension in the filename). This file should be located at the root of the workflow directory (here, <code>workflow/Snakefile</code>).</p>
<div class="admonition note">
<p class="admonition-title">Paths in Snakemake</p>
<p>All the paths in Snakefile are relative to the directory containing the Snakefile.</p>
<p>All the paths in the Snakefile are relative to the directory containing the Snakefile.</p>
</div>
<p><strong>Exercise:</strong> Create a Snakefile and copy the rule in it. Because the Snakemake language is basically Python, <em>do not forget to keep the indentation as is and use space characters in the indents instead of tabs.</em></p>
<p><strong>Exercise:</strong> Create a Snakefile and copy the previous rule in it. Because the Snakemake language is built on top of Python, spaces and indents are essential, so <em>do not forget to keep the indentation as is and use space characters in the indents instead of tabs.</em></p>
<h3 id="executing-a-workflow-with-a-precise-output">Executing a workflow with a precise output</h3>
<p>It is now time to execute your first worklow! To do this, you need to tell Snakemake what is your target, <em>i.e.</em> what is the output that you want to generate.</p>
<p><strong>Exercise:</strong> Execute the workflow with <code>snakemake --cores 1 &lt;target&gt;</code>. What value should you use for &lt;target&gt;? Once Snakemake execution is finished, can you locate the output file?</p>
<p><strong>Exercise:</strong> Execute the workflow with <code>snakemake --cores 1 &lt;target&gt;</code>. What value should you use for <code>&lt;target&gt;</code>? Once Snakemake execution is finished, can you locate the output file?</p>
<details class="done">
<summary>Answer</summary>
<p>Execute the workflow: <code>snakemake --cores 1 results/first_step.txt</code>
Expand All @@ -732,16 +733,16 @@ <h3 id="executing-a-workflow-with-a-precise-output">Executing a workflow with a
You can change this behaviour and force the re-run of a specific target by using the `-f` option: `snakemake --cores 1 -f results/first_step.txt` or force recreate ALL the outputs of the workflow using the `-F` option: `snakemake --cores 1 -F`. In practice, you can also alter Snakemake re-run policy, but we will not cover this topic in the course (see [--rerun-triggers option](https://snakemake.readthedocs.io/en/stable/executing/cli.html) in Snakemake&#39;s CLI help and [this git issue](https://github.com/snakemake/snakemake/issues/1694) for more information).
</code></pre></div>

<p>In the previous example, values for these two directives are <strong>strings</strong>. For the <code>shell</code> directive (we will see other types of directive values later in the course), long string can be written on multiple lines for clarity, simply using a set of quotes for each line:</p>
<p>In the previous example, the values of the two rule directives are <strong>strings</strong>. For the <code>shell</code> directive (we will see other types of directive values later in the course), long string can be written on multiple lines for clarity, simply using a set of quotes for each line:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">first_step</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="s1">&#39;results/first_step.txt&#39;</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s1">&#39;echo &quot;I want to print a very very very very very very &#39;</span>
<span class="s1">&#39;very very very very long string in my output&quot; &gt; results/first_step.txt&#39;</span>
</code></pre></div>
<h3 id="using-the-input-directive">Using the input directive</h3>
<p>The next directive used by most rules is <code>input</code>. Like <code>output</code>, <code>input</code> indicates the path to a file that is required by the rule to generate the output. In the following example, we modified the previous rule to use the file previously created <code>results/first_step.tsv</code> as an input and copy this file to <code>results/second_step.txt</code>:</p>
<h3 id="adding-an-input-directive">Adding an input directive</h3>
<p>The next directive used by most rules is <code>input</code>. Like <code>output</code>, <code>input</code> indicates the path to a file that is required by the rule to generate the output. In the following example, we modified the previous rule to use the file previously created <code>results/first_step.tsv</code> as an input, and copy this file to <code>results/second_step.txt</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">second_step</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s1">&#39;results/first_step.txt&#39;</span>
Expand All @@ -751,29 +752,30 @@ <h3 id="using-the-input-directive">Using the input directive</h3>
<span class="s1">&#39;cp results/first_step.txt results/second_step.txt&#39;</span>
</code></pre></div>
<p>Note that with this rule definition, Snakemake <strong>will not run</strong> if <code>results/first_step.tsv</code> does not exist!</p>
<p><strong>Exercise:</strong> Modify your first rule to add an input and execute the workflow. Check that the output was created and that the files are identical.</p>
<p><strong>Exercise:</strong> Modify your first rule to add an input directive and execute the workflow. Check that the output was created and that the files are identical.</p>
<details class="done">
<summary>Answer</summary>
<p>Execute the workflow: <code>snakemake --cores 1 results/second_step.txt</code>
Visualise your directory content: <code>ls -alh results/</code>
Check that the files are identical <code>diff results/first_step.txt results/second_step.txt</code></p>
</details>
<h3 id="using-several-rules-in-a-workflow">Using several rules in a workflow</h3>
<h3 id="creating-a-workflow-with-several-rules">Creating a workflow with several rules</h3>
<p>Creating one Snakefile per rule does not seem like a good solution, so let&rsquo;s try to improve this.</p>
<p><strong>Exercise:</strong> Delete the <code>results/</code> folder, gather the two previous rules in the same Snakefile (place the <code>first_step</code> rule first) and try to run the workflow <strong>without specifying an output</strong>. What happens?</p>
<p><strong>Exercise:</strong> Delete the <code>results/</code> folder, copy the two previous rules (<code>first_step</code> and <code>second_step</code>) in the same Snakefile (place the <code>first_step</code> rule first) and try to run the workflow <strong>without specifying an output</strong>. What happens?</p>
<details class="done">
<summary>Answer</summary>
<p>Execute the workflow without outputs: <code>snakemake --cores 1</code>.
When executed, Snakemake tries to generate a specific output called <strong>target</strong>, and resolves all dependencies based on this target. A target can be any output that can be generated by any rule in the workflow. When you do not specify a target, the default one is the output of the first rule in the Snakefile, here <code>results/first_step.txt</code>. If you had placed the <code>second_step</code> rule in first position, Snakemake would have crashed because the input for this rule does not exist. If you have enough time, feel free to try it!</p>
<p>Execute the workflow without output: <code>snakemake --cores 1</code>.
Only the first output, <code>results/first_step.txt</code>, is created. During its execution, Snakemake tries to generate a specific output called <strong>target</strong>, and resolves all dependencies based on this target. A target can be any output that can be generated by any rule in the workflow. When you do not specify a target, the default one is the output of the first rule in the Snakefile, here <code>results/first_step.txt</code> of <code>rule first_step</code>. If you had placed the <code>second_step</code> rule in first position, Snakemake would have crashed because the input for this rule does not exist. If you have enough time, feel free to try it!</p>
</details>
<p><strong>Exercise:</strong> With this in mind, use a space-separated list of targets (instead of one filename) in your command to generate multiple targets. Use the <code>-F</code> to force the re run of the whole workflow or delete your <code>results/</code> folder beforehand.</p>
<p><strong>Exercise:</strong> With this in mind, instead of one target, use a space-separated list of targets in your command, to generate multiple targets. Use the <code>-F</code> to force the re-run of the whole workflow or delete your <code>results/</code> folder beforehand.</p>
<details class="done">
<summary>Answer</summary>
<p>Execute the workflow with multiple targets: <code>snakemake --cores 1 -F results/first_step.txt results/second_step.txt</code>
<p>Delete the results folder: <code>rm -rf results/</code>
Execute the workflow with multiple targets: <code>snakemake --cores 1 results/first_step.txt results/second_step.txt</code>
You should now see Snakemake execute the 2 rules and produce both targets/outputs.</p>
</details>
<h3 id="chaining-rules">Chaining rules</h3>
<p>Once again, writing all the outputs in the <code>snakemake</code> command does not look like a good solution: it is very time-consuming, error-prone and annoying! Imagine what happens when your workflow generate tens of outputs?! Fortunately, there is a way to simplify this, which relies on rules dependency. The core principle of Snakemake&rsquo;s execution is to compute a Directed Acyclic Graph (DAG) that summarizes dependencies between all inputs and outputs required to generate the final desired output. For each job, starting from the jobs generating the final output, Snakemake checks if the required inputs exist. If they do not, the software looks for a rule that generates the input. This process is repeated until all dependencies are resolved. This is why Snakemake is said to have a &lsquo;bottom-up&rsquo; approach: it starts from the last outputs and go back to the first inputs.</p>
<p>Once again, writing all the outputs in the <code>snakemake</code> command does not look like a good solution: it is very time-consuming, error-prone (and annoying)! Imagine what happens when your workflow generate tens of outputs?! Fortunately, there is a way to simplify this, which relies on rules dependency. The core principle of Snakemake&rsquo;s execution is to compute a Directed Acyclic Graph (DAG) that summarizes dependencies between all inputs and outputs required to generate the final desired output. For each job, starting from the jobs generating the final output, Snakemake checks if the required inputs exist. If they do not, the software looks for a rule that generates the input. This process is repeated until all dependencies are resolved. This is why Snakemake is said to have a &lsquo;bottom-up&rsquo; approach: it starts from the last outputs and go back to the first inputs.</p>
<div class="admonition hint">
<p class="admonition-title">Hint</p>
<p>Your Snakefile should look like this:</p>
Expand All @@ -795,13 +797,14 @@ <h3 id="chaining-rules">Chaining rules</h3>
<p><strong>Exercise:</strong> Delete the <code>results/</code> folder, identify your final output and execute the workflow <strong>specifying only this output</strong> in the command.</p>
<details class="done">
<summary>Answer</summary>
<p>Execute the workflow: <code>snakemake --cores 1 results/second_step.txt</code>
<p>Delete the results folder: <code>rm -rf results/</code>
Execute the workflow: <code>snakemake --cores 1 results/second_step.txt</code>
Visualise your directory content: <code>ls -alh results/</code>
You should now see Snakemake execute the 2 rules and produce both outputs. To generate the output <code>results/second_step.txt</code>, Snakemake requires the input <code>results/first_step.txt</code>. Before the workflow is executed, this file does not exist, therefore, Snakemake looks for a rule that generates <code>results/first_step.txt</code>, in this case the first defined rule <code>first_step</code>. The process is then repeated for <code>first_step</code>. In this case, the rule does not require an input, so all dependencies are resolved, and Snakemake can generate the DAG.</p>
You should now see Snakemake execute the 2 rules and produce both outputs. To generate the output <code>results/second_step.txt</code>, Snakemake requires the input <code>results/first_step.txt</code>. Before the workflow is executed, this file does not exist, therefore, Snakemake looks for a rule that generates <code>results/first_step.txt</code>, in this case the rule <code>first_step</code>. The process is then repeated for <code>first_step</code>. In this case, the rule does not require an input, so all dependencies are resolved, and Snakemake can generate the DAG.</p>
</details>
<h3 id="important-notes-on-chaining-rules">Important notes on chaining rules</h3>
<h4 id="rules-produce-unique-outputs">Rules produce unique outputs</h4>
<p>Because of the rules dependency process, by default, an output can only be generated by a single rule. Otherwise, Snakemake cannot decide which rule to use to generate this output, and the rules are considered <strong>ambiguous</strong>. In practice, there are ways to deal with ambiguous rules, but we will not cover them in this course (see <a href="https://snakemake.readthedocs.io/en/v7.32.3/snakefiles/rules.html#handling-ambiguous-rules">the relevant section in the official documentation</a>).</p>
<p>Because of the rules dependency process, by default, an output can only be generated by a single rule. Otherwise, Snakemake cannot decide which rule to use to generate this output, and the rules are considered <strong>ambiguous</strong>. In practice, there are ways to deal with ambiguous rules, but we will not cover them in this course (see <a href="https://snakemake.readthedocs.io/en/v7.32.3/snakefiles/rules.html#handling-ambiguous-rules">the relevant section in the official documentation</a> for more information).</p>
<h4 id="rules-dependency-can-be-written-more-easily">Rules dependency can be written more easily</h4>
<p>It is possible to refer to the output of a rule directly in another rule with the syntax <code>rules.&lt;rule_name&gt;.output</code>. Note that you don&rsquo;t need quotes around this statement, because it is a Snakemake object. The following example implements this syntax for the two rule defined above:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">first_step</span><span class="p">:</span>
Expand All @@ -822,6 +825,7 @@ <h4 id="rules-dependency-can-be-written-more-easily">Rules dependency can be wri
* It limits the risk of error because you do not have to write the same filename at several locations
* A change in output name will be automatically propagated to rules that depend on it, <em>i.e.</em> the name only has to be changed once
* This makes the code much clearer and easier to understand: with this syntax, you instantly know the object type (<code>rule</code>), how it is created (<code>first_step</code>) and what it is (<code>output</code>)</p>
<p>For the next sessions of exercises, try to use this syntax as much as possible.</p>



Expand Down
Loading

0 comments on commit 4f39cdf

Please sign in to comment.