π If you're tired of setting up the same directory and file structure for your new scientific projects again and again, then this might be for you ;-)
This repository provides a "template" of a directory structure for small to medium-sized scientific projects, making use of CookieCutter, a templating engine for project structures. Check out the links at the bottom of the page to create your own CookieCutter or use this one to start your project. Also, feel free to fork the repository and adjust it to your own needs.
- What is it good for? or How this can boost your productivity
- About this template
- Requirements
- Usage
- Project Structure
- Dummy files
- Sources of inspiration
- Contributing
CookieCutter is a templating engine for creating directory structures including pre-defined files based on a set of questions asked during the setup.
By running cookiecutter
with this repository, a new directory will be created with a pre-defined structure and some default files, making you all set to start a new scientific Python project without having to manually create the same files & structure over and over again.
This includes
- code that is importable from every place in your environment
- automatically resolved paths to the project's root and the directories for data, plots, logs, etc.
make
commands to run automated unit tests, create documentation of your code, etc.- creating a nice HTML representation of your Jupyter notebooks and of your doc strings
- and so on... π
This template provides a boilerplate for small to medium-size (scientific) data projects, e.g. a thesis, a group project, or similar. For an overview of the directory & file structure have a look at the section further below. The redundant parts (mainly for demonstration purposes) are only few and and can be safely removed.
Note
Once set up, a Git repository is automatically initialized. If you want to connect it with a remote repository on GitHub (or any other hosted git service) you need to add the respective remote repository to your local repository.
You need to have Python installed as well as the Python package cookiecutter
.
You can do this either via pip or conda:
$ pip install -U cookiecutter cruft
$ conda install -c conda-forge cookiecutter cruft
cruft
is optional. It is a tool that helps you to update your project template to the latest version. More details can be found on https://cruft.github.io/cruft/.
Besides that, there is no need to clone or download anything from this repository. Just follow the next step :-)
Tip
I recommend you to install Mamba as a package manager. It is built on conda
but has a much greater performance.
If you plan to use Git as a version control system, ensure that you have it installed on your machine and that you specified the global Git configuration settings (this needs to be done only once):
$ git config --global user.name "John Doe"
$ git config --global user.email [email protected]
After having cookiecutter
installed, create a new project from this template by executing one of the following commands:
$ cookiecutter https://github.com/markusritschel/cookiecutter-pysci-project.git
$ cookiecutter gh:markusritschel/cookiecutter-pysci-project
$ cookiecutter git+https://github.com/markusritschel/cookiecutter-pysci-project
$ cookiecutter git+ssh://[email protected]/markusritschel/cookiecutter-pysci-project.git
Follow the instructions.
You will be asked some questions based on the entries in the cookiecutter.json
.
Once, done, you have a fresh directory structure, modified with the information you just provided to CookieCutter.
You should now create a new/dedicated virtual environment either with conda
(for better performance use mamba
) or pipenv
or something similar.
For a conda
environment, you can simply execute
make conda-env
This should create a conda environment named after your project. Alternatively, create your environment manually.
Now install the required packages via
make install-requirements
Finally, to make the source code in src
available for import, execute
make src-available
πͺ You are now all set to start with your new project π.
Tip
For further information, see also the README of your new project.
You may also want to check out the Makefile commands (simply type make
to get an overview of the available commands).
If you don't have Make available on your computer (especially on Windows this seems to be problematic), then please check the content of the Makefile and find the respective commands, which you can all execute in the "raw" form.
For example, instead of make src-available
you can also execute pip install -e .
.
The Makefile in the project directory provides some default routines like cleanup, testing, installing requirements, etc.
Even though for many people using make
may seem a bit old-fashioned, I would recommend you have a look at Make's great capability of dealing with dependencies.
This is particularly useful if, for example, the first step in your data-processing pipeline takes a long time to process your raw data and generate the interim product.
I usually structure my data-processing workflow such that I can run a single process via the command line (for example python scripts/process-raw-data.py -o ./output_dir
).
(The Python packages click, fire, and docopt provide neat functionalities to convert your scripts into interactive command-line interfaces.)
These commands I can then set as targets in the Makefile, for example:
## Process raw data and write the newly generated data into ./data/interim/
process_raw_data:
python scripts/process-raw-data.py
I can now simply run make process_raw_data
in the project's root directory.
Let's assume that the previous step (processing the raw data) generates new data inside ./data/interim/
. If now I have a second processing step that depends on the data generated by the previous step, I can set these data as dependencies for the new rule:
## Process interim data
process_interim_data: $(wildcard data/interim/**/*)
python scripts/process-interim-data.py
This way, the last step is only executed if the data it depends on has changed since the last time of execution.
For further information, have a look at Make's documentation: https://www.gnu.org/software/make/manual/html_node/Rules.html
Going one step further, in addition or as an alternative to make
, Snakemake provides even more extended functionalities.
Snakemake is pure Python, making it very convenient to work with and providing all the functionality of Python in your Snakemake workflow.
In Snakemake you define dependencies not as an "artificial" target but you indicate the target file you want to create, and Snakemake takes care of producing the required dependencies.
Another strength of Snakemake is that it is easily scalable:
Porting your Snakemake workflow from your local machine to a High-Performance Computing system is as straightforward as adding a few extra parameters to the executed command.
This way, Snakemake automatically generates bash scripts and submits them as jobs on the HPC, automatically distributing the tasks of the workflow.
The documentation of your project should include two kinds of documentation:
- A documentation of your code, i.e. what it does, how to use it, and where to find the various functions β similar to what you would expect when opening the online documentation of a Python package. This should be considered as best practice and always be shipped with your code. Most of this information you can and should indicate in the docstrings of your functions (you can even integrate short tests as part of examples). The rest can be written in normal Markdown or even Jupyter notebooks (see also the section below). You also find some examples in the documentation shipped with this CookieCutter template.
- A documentation of your project, i.e. a work log of what you did and why, what results you got etc.
I personally recommend you to use Jupyter-Book (or Sphinx) for the project documentation. Both of which are particularly suited for documenting Python code. Sphinx is super feature-rich but also has a steep learning curve. Jupyter-Book, on the other hand, is based on Sphinx and can therefore make use of all the complex features Sphinx provides, thereby being much easier to learn. This is not least because you can write your content in Markdown or even integrate your Jupyter notebooks. Also, with Sphinx's autoapi extension (which can also be integrated in Jupyter-Book) you can parse the docstrings of your code and process them to nice HTML output.
Tip
Describe as much of your code as possible in the docstrings of your functions, classes, and modules. Things should be as far as possible only described at one place and referenced at other occurrences.
π Side note
Jupyter-Book can also be a nice way to share your collection of Jupyter notebooks online. There are plugins that allow people to comment on the rendered HTML representation of your notebooks.
In addition, you might want to write a scientific report or paper emerging from your project. The standard therefore is LaTeX, I guess. However, there are also alternatives that even integrate well with LaTeX and allow you to translate your documentation either into an interactive website or a static PDF. Such an alternative is, for example MyST-MD, which is more focussed on scientific publications in comparison to Jupyter-Book, which is more suited for technical documentations that include Jupyter notebooks. To a certain extent, both are quite similar, though.
Important
Write your docstrings and code-related documentation alongside your code, i.e. in the same git branch you develop your code. For the scientific report/paper consider writing in a dedicated branch.
Tip
Have a look at the "Guides to Better Science" from the British Ecological Society (see sources at the bottom).
Publish your documentation with Github pages
GitHub allows you to host static websites on its platform.
In this project template, I integrated a workflow for automatic deployment of your documentation (see the .github/workflows/main.yml
).
The only thing you need to prepare is to go into your repository's settings (on Github), go to Pages and then select "GitHub Actions" for Source. Save your changes. That's it.
Now, whenever you push something to the main branch, your documentation in docs
will be automatically compiled and deployed. The result will be available on https://<your-github-username>.github.io/<your-project-name>
. Magicβ¦ πͺπ
Note
Keep in mind that the deployment may take a while. You can check the status of the workflow by clicking on "Action" in the menu bar of your repository.
All high-level code (i.e. the code that the user is directly interacting with) should reside in the scripts/
and the notebooks/
directory.
High-level code is, for example, code that produces a figure, a report, or similar.
Name scripts and notebooks in a way that indicates their order of execution (examples can be found in the respective directories).
It is also recommended to have one script for each task, i.e. the creation of one figure or one table.
Code residing in src/
is exclusively source code or low-level code and is not meant to be actively run but rather used in your scripts and notebooks.
β οΈ A note on version controlling Jupyter notebooks
It is very ugly to keep Jupyter Notebooks under version control as they are in principle a very large JSON file, containing lots of metadata, output of your cells, etc. This circumstance makes it also quite hard to collaborate on them. However, there's help: Jupytext syncs your Jupyter notebooks with another file for which you can choose a variety of formats (e.g. Markdown, R Markdown, normal Python, etc.). These "paired" files, which can either reside alongside your Jupyter notebooks or in a separate directory, can then be easily version-controlled. Jupytext can either be used from the command line (jupytext --sync notebooks/*ipynb
) or as a Jupyter plugin. For more information, visit https://jupytext.readthedocs.io/.
βββ assets <- A place for assets like shapefiles or config files
β
βββ data <- Contains all data used for the analyses in this project.
β β The sub-directories can be links to the actual location of your data.
β β However, they should never be under version control! (-> .gitignore)
β βββ interim <- Intermediate data that have been transformed from the raw data
β βββ processed <- The final, processed data used for the actual analyses
β βββ raw <- The original, immutable(!) data
β
βββ docs <- The technical documentation (default engine: Sphinx; but feel free to
β use MkDocs, Jupyter-Book or anything similar).
β This should contain only documentation of the code and the assets.
β A report of the actual project should be placed in `reports/book`.
β
βββ logs <- Storage location for the log files being generated by scripts
β
βββ notebooks <- Jupyter Notebooks. Follow a naming convention, such as a number (for ordering),
β β and a short `-` or `_` delimited description, e.g. `01-initial-analyses`
β βββ _paired <- Optional location for your paired Jupyter Notebook files
β βββ exploratory <- Notebooks for exploratory tasks
β βββ reports <- Notebooks generating reports and figures
β
βββ references <- Data descriptions, manuals, and all other explanatory materials
β
βββ reports <- Generated reports (e.g. HTML, PDF, LaTeX, etc.)
β βββ figures <- Generated graphics and figures to be used in reporting
β βββ README.md <- More information about Jupyter-Book and MyST-MD
β
βββ scripts <- High-level scripts that use (low-level) source code from `src/`
βββ src <- Source code (and only source code!) for use in this project
β βββ core <- Provides some core functionalities
β βββ tests <- Contains tests for the code in `src/`
β βββ __init__.py <- Makes src a Python module and provides some standard variables
β
βββ .env <- In this file, specify all your custom environment variables
β Keep this out of version control! (i.e. have it in your .gitignore)
βββ .gitignore <- Here, list all the files and folders (patterns allowed) that you want to
β keep out of git version control.
βββ CHANGELOG.md <- All major changes should go in there
βββ CITATION.cff <- The citation information for this project (update your ORCID ID!)
βββ environment.yml <- The conda environment file for reproducing the environment
βββ LICENSE <- The license used for this project
βββ Makefile <- A self-documenting Makefile for standard CLI tasks
βββ pyproject.toml <- Configuration file for the project
βββ README.md <- The top-level README of this project
βββ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
generated with `pip freeze > requirements.txt`
The following files are for demonstration purposes only and, if not needed, can be deleted safely:
βββ notebooks/01-minimal-example.ipynb
βββ docs/*
βββ reports/book/*
βββ scripts/01-test.py
βββ src
βββ tests/*
βββ submodule.py
Some great sources of inspiration and orientation when I created this template:
- A great article on how to structure your scientific data projects: https://drivendata.github.io/cookiecutter-data-science
- https://coderefinery.github.io/reproducible-research/
- https://github.com/drivendata/cookiecutter-data-science
- https://github.com/audreyfeldroy/cookiecutter-pypackage
- https://github.com/hackalog/easydata
- https://github.com/aubricus/cookiecutter-python-package
- Martin, R. C. (Ed.). (2009). Clean code: A handbook of agile software craftsmanship. Prentice Hall.
- Croucher, M., Graham, L., James, T., Krystalli, A., & Michonneau, F. (2019). Reproducible Code (Guides to Better Science). British Ecological Society. https://www.britishecologicalsociety.org/publications/guides-to/
Issues & pull requests accepted.
Β© Markus Ritschel, 2021β2024