Skip to content

Reproducible Research

Finlay Maguire edited this page Nov 4, 2020 · 3 revisions

The general principles of reproducible computational research are organised nicely in this concise 10 tips PLoS Computational Biology article: here

Similarly, papers such as this are a good example of a gold-standard in practice: here

Generally, if you have documentation (or a script) to retrieve all the raw data, version dependencies, and a workflow or notebook to regenerate all the paper figures you are way ahead of the curve.

Notebooks

Notebook approaches include jupyter, google colabs, and Rmarkdown notebooks.

Workflow Tools

There have been comparisons of the pros and cons of different options but generally snakemake and nextflow are the two workflow tools that seem to have the fastest growth. That said Broad institute push WDL and the CWL has been around for a longer period. There are also many niche workflow tools out there like bpipe, luigi etc.

  • snakemake has the benefit of being python but uses a "Makefile" like mental model where you have to work back from the outputs you want and write tools to generate all the files you need. The best place to try and understand this and snakemake itself is the provided tutorial. You can also find existing high quality snakemake workflows to use as a model.

  • nextflow uses groovy (JVM-based language) and is a little bit more alien in syntax for people without experience outside python/R. However, it does have a mental model that works forward from inputs very like the standard POSIX pipe syntax (i.e. the | operator in cut -f 1 a.tsv | sort). This is a little more intuitive for more complex operations like clustering where the number of output files is data-dependent. Nextflow is also a bit more powerful and makes it harder to shoot yourself in the foot by overwriting your own files. As ever the documentation and tutorial examples are a great place to start, nextflow also curates a set of high quality validated workflows as nf-core that act as great examples.

To give a concrete example of these tools side by side, here is a simple workflow that runs ORF prediction on a set of bacterial genome assemblies and searches predicted ORFs against a user-supplied database implemented in each.