Skip to content

Commit

Permalink
Update the user interface and implementation
Browse files Browse the repository at this point in the history
  • Loading branch information
sphuber committed Dec 21, 2021
1 parent cfde93d commit 333536f
Showing 1 changed file with 97 additions and 68 deletions.
165 changes: 97 additions & 68 deletions xxx_shell_functions/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,29 +88,27 @@ AiiDA will use `shutil.which` from the standard library to try and determine the

We can invoke the command by simply running the function as you would any other Python function:
```python
date()
results = date()
```
Note that this won't return anything nor print anything.
But this is not surprising, because the shellfunction didn't return any output nodes.
The shell command `date` merely printed the current date to the stdout file descriptor.
Note that the `results` will be an empty dictionary since the `date` function doesn't return anything.
Nor will the function print anything, because AiiDA will automatically capture the output that the command wrote to the `stdout` and `stderr` file descriptors to the file repository of the process node.

### Capturing output
By default, shellfunction's will add the output, that was written to the stderr and stdout file descriptors by the command, to the file repository of the node.
To retrieve them, we need to get the node that represents the shellfunction's execution as follows:
### Captured output
The captured `stdout` and `stderr` content can be retrieved from the process node that represents the `shellfunction`'s execution in the provenance graph.
This can be obtained by calling the `run_get_node` method of the `shellfunction` instead of calling it directly:
```python
results, node = date.run_get_node()
print(node.stdout)
print(node.stderr)
```
By calling the method `run_get_node` of the shellfunction, a tuple will be returned, where the second element is the node that represents its execution in the provenance graph.
From this node, we can get the content of the stdout:
```python
node.get_object_content('stdout')
```
which should print something like `Wed 15 Dec 2021 23:11:02 CET`.
This will now return the node in addition to the results.
Through the `node` you can retrieve the captured `stdout` and `stderr` through the corresponding properties.

### Capturing output
Sometimes, the output written to stdout by the shell command is the most important output and so it might be useful to attach it as an actual output node; this way it will be explicitly visible in the provenance graph.
To do so, simply specify the `output_filename` argument in the `shellfunction` decorator.
To do so, simply set `attach_stdout=True` in the `shellfunction` decorator.
```python
@shellfunction(command='date', output_filename='stdout')
@shellfunction(command='date', attach_stdout='stdout')
def date():
"""Run the ``date`` command."""
```
Expand All @@ -120,16 +118,17 @@ If we now run the function, we will notice that the `results` are no longer empt
In [1]: results = date()
In [2]: results
Out[2]: {'output': <SinglefileData: uuid: f18624f0-2c3f-4a98-8b82-dd2bf8c99b32 (pk: 1180)>}
Out[2]: {'stdout': <SinglefileData: uuid: f18624f0-2c3f-4a98-8b82-dd2bf8c99b32 (pk: 1180)>}
```
As we can see, the results now contain a `SinglefileData` node that is attached as an output node with the link label `output`.
As we can see, the results now contain a `SinglefileData` node that is attached as an output node with the link label `stdout`.
Printing its output should show us the output written by the `date` command to stdout:
```python
In [1]: results = date()
In [2]: results['output'].get_content()
In [2]: results['stdout'].get_content()
Out[2]: 'Thu 16 Dec 2021 09:53:42 CET\n'
```
Note that regardless of the value of `attach_stdout`, the output written to `stdout` can always be retrieved through the `stdout` property of the process node.

### Command line arguments

Expand All @@ -140,8 +139,8 @@ Arguments like this can be passed to the command through the `arguments` keyword
```python
In [1]: from aiida.orm import List
In [2]: @shellfunction(command='date', output_filename='stdout')
...: def date(**kwargs):
In [2]: @shellfunction(command='date', attach_stdout=True)
...: def date():
...: """Run the ``date`` command with optional arguments."""
...:
Expand Down Expand Up @@ -179,8 +178,8 @@ In [1]: import io
...: from aiida.engine import shellfunction
...: from aiida.orm import SinglefileData
In [2]: @shellfunction(command='cat', output_filename='stdout')
...: def cat(**kwargs):
In [2]: @shellfunction(command='cat', attach_stdout=True)
...: def cat():
...: """Run the ``cat`` command."""
In [3]: node_a = SinglefileData(io.StringIO('content_a'))
Expand All @@ -200,6 +199,7 @@ So in this case, the `cat` shellfunction is called with the `node_a` node assign
This will cause the `{file_a}` argument to be replaced with the content of the `node_a` node.

Since the placeholder needs to correspond to the keyword of the function call, which is used as the link label, the placeholder needs to be a valid link label.
A valid link label consists of alphanumeric characters and underscores, without consecutive underscores.
Besides that, there are no other restrictions on its format.

Positional and keyword arguments can be mixed without a problem.
Expand All @@ -210,7 +210,7 @@ In [1]: import io
...: from aiida.orm import SinglefileData
In [2]: @shellfunction(command='head', output_filename='stdout')
...: def head(**kwargs):
...: def head():
...: """Run the ``head`` command."""
In [3]: single_file = SinglefileData(io.StringIO('line 1\nline 2'))
Expand All @@ -226,6 +226,7 @@ Out[5]: 'line 1\n'
The previous section described how the output written to stdout by the command is captured.
However, certain commands can produce more output besides what is written to stdout.
Take for example the `split` command, which takes a file and splits it into multiple files.
By default, it will write the output files with filenames that follow the sequence `xaa`, `xab`, `xac`, etc.
When this is wrapped with a `shellfunction` one would like to attach each generated output file as an individual `SinglefileData` output node.

The following example shows how multiple output files can be attached as `SinglefileData` output nodes:
Expand All @@ -234,18 +235,9 @@ In [1]: import io
...: from aiida.engine import shellfunction
...: from aiida.orm import SinglefileData
In [2]: @shellfunction(command='split')
...: def split(**kwargs):
In [2]: @shellfunction(command='split', output_filenames['x*'])
...: def split():
...: """Run the ``split`` command."""
...: results = {}
...:
...: # The ``cwd`` variable contains the path to the working directory where the shell command was executed.
...: for file in cwd.iterdir():
...: # `split` writes output files with the format `xaa, xab, xac...` etc.
...: if file.name.startswith('x'):
...: results[file.name] = SinglefileData(file)
...:
...: return results
...:
In [3]: single_file = SinglefileData(io.StringIO('line 1\nline 2\line 3'))
Expand All @@ -256,14 +248,9 @@ Out[4]:
{'xab': <SinglefileData: uuid: 1788fa7e-8ed8-4207-a873-147d7e8d4821 (pk: 1218)>,
'xaa': <SinglefileData: uuid: e1481db5-09c1-4b49-b8c5-5127ed40c108 (pk: 1219)>}
```
In all the examples so far, the `shellfunction` implementation was actually empty; it merely consisted of `pass` or a docstring.
In this case, however, we actually need to implement some Python code.
The code inspects the current working directory (which is a temporary directory on the local file system where the shell command is executed), finds all output files generated by the `split` command and returns them as a dictionary of `SinglefileData` nodes.
As you can see, the filepath of the temporary working directory is provided by the `cwd` variable, which is injected into the scope of the function by AiiDA's engine.
It is an instance of the `pathlib.Path` class from the [`pathlib` standard library module](https://docs.python.org/3/library/pathlib.html), and so we can operate on it with all the standard Python tools.
Using [`iterdir`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.iterdir) we iterate over the files in the directory.
We filter the files to only take those that start with `x`, which is the default prefix used by the `split` command, and then wrap the file in a `SinglefileData`.
This filtering is necessary, because the directory will also contain the `stdout` and `stderr` files, as well as the input file, which is written to the working directory using the node's UUID.
The `output_filenames` argument of the `shellfunction` decorator takes a list of strings which represent the output files that should be wrapped and attached as outputs.
If the exact name that will be produced is not known, one can use the wildcard operator `*` which will be expanded by globbing.
The filenames will be used as the link label of the attached output node, where illegal characters are automatically replaced by underscores.

### Workflow demonstration

Expand All @@ -283,25 +270,18 @@ from aiida.orm import List, SinglefileData
from aiida.engine.processes.functions import shellfunction, workfunction
@shellfunction(command='split')
def split(**kwargs):
@shellfunction(command='split', output_filenames=['x*'])
def split():
"""Run the ``split`` command."""
results = {}
for file in cwd.iterdir():
if file.name.startswith('x'):
results[file.name] = SinglefileData(file)
return results
@shellfunction(command='head', output_filename='stdout')
def head(**kwargs):
@shellfunction(command='head', attach_stdout=True)
def head():
"""Run the ``head`` command."""
@shellfunction(command='cat', output_filename='stdout')
def cat(**kwargs):
@shellfunction(command='cat', attach_stdout=True)
def cat():
"""Run the ``cat`` command."""
Expand Down Expand Up @@ -333,18 +313,67 @@ In addition, the calls to the `shellfunction`s are also explicitly represented w
![Workflow provenance graph](provenance.svg "Workflow provenance graph")


## Questions and design choices
## Design choices

This section records open questions on the design and behavior of the new functionality.
Once the questions have been answered in the process of the AEP evaluation, answers are added and serve to record the discussion surrounding the design choices.

* What process node class should be used for invoked shellfunctions?
The most intuitive solution would be to create a new class `ShellFunctionNode` that subclasses `CalculationNode`.
This mirrors the behavior of the `CalcFunctionNode` which is used by `calcfunctions` and allows to easily query for nodes representing `shellfunction` executions.
However, it is not clear if the current provenance graph grammar can easily be extended to support a new node type.
In its function, the `ShellFunctionNode` should behave similar to the `CalcFunctionNode` in that it gets incoming `INPUT_CALC` and outgoing `CREATE` link types.
* Should the specification of an invalid command (where `shutil.which(command)` returns `None`) be raised as exception and let it bubble up, or should we return fixed exit code?
* Should the `output_filename` argument to the `shellfunction` decorator support to define the relative filename with which the stdout will be written to the `SinglefileData` output node, or should it merely be a boolean flag to indicate that the stdout should be captured as an output node.
The filename within the `SinglefileData` is not really relevant since it is the only file and the `get_content` method allows to retrieve it without even having to know the filename.
But maybe there are circumstances imaginable where specifying the actual filename may be desirable?
This section details the design choice for various parts of the functionality and interface.
Where applicable, it provides other solutions that were considered and why they were rejected.

### Function signature
The function decorated by the `shellfunction` decorator should at the very least support the `**kwargs` parameters.
This is because it should allow the `arguments` input node and any `SinglefileData` input nodes to be passed when invoked, even though the function body most likely won't have to operate on them as everything is taken care of by the engine.
Even though it is not that much of a burden for the user to always add `**kwargs` to the function signature, since it practically should always be added, it may as well be added automatically.
That is why the `shellfunction` decorator will inspect the decorated function signature and automatically add the `**kwargs` parameter if not already specified.

### Recording output
When talking about output that is generated by shell commands, it can typically be divided into the following three categories:

* Output written to the `stdout` file descriptor
* Output written to the `stderr` file descriptor
* Output written to a file on disk

#### `stdout` and `stderr`
Since the `stdout` and `stderr` file descriptors are standard and a lot of shell commands are expected to write to them, it makes sense to automatically capture them.
The content should be easily accessible, but should not necessarily have to be recorded as an output node.
That is why in the implementation, by default, they are written to the file repository of the process node with the names `stdout` and `stderr`.
The `ShellFunctionNode` class provides the `stdout` and `stderr` properties as an easy way to retrieve the content from the node.

Having the `stderr` available as a file of the process node is all that is needed, however there is an asymmetry with `stdout`.
The output written to `stdout` by a shell command is often the "main" output and in workflows it is often piped as input to the next command.
In the context of AiiDA, this means it should rather be recorded as an actual output node of the process by wrapping it in a `SinglefileData` node.
If the `stdout` should be returned as an individual output node, the `shellfunction` should define `attach_stdout = True`.
In this case, the `stdout` will not also be written to the process node's repository, as that would be doubling the information.
The `stdout` property will account for this implementation detail, and will retrieve the content from one of these two options, such that the user does not need to know.

#### Output files
Besides `stdout` and `stderr`, a shell command can typically also write output to various files on the local file system.
The user needs to be able to specify to the `shellfunction` which output files should be automatically captured and attached as outputs through `SinglefileData` nodes.
Since there can be any number of output files, it should at the very least accept a list of filenames.
It is not always possible to declare exactly what files will be generated, so the `shellfunction` should support the use of the wildcard character `*`.

The declared output files will be wrapped in `SinglefileData` nodes which will be returned as outputs of the process and so they require a link label.
The most logical solution is to take the filename, however, this is not always possible, because valid filenames are not necessary valid link labels.
The validation rules on link labels are very strict and only alphanumeric characters and underscores are allowed.
In addition, it is not possible to use multiple consecutive underscores, which is reserved by AiiDA to indicate namespaces.
This means that typical filenames, which contain a `.` before the file extension, are not valid link labels.

Automatic substitution of illegal characters by underscores is not really a viable solution.
Take for example the filename `output_.txt`, which would be automatically converted to `output__txt`, which would also be illegal.
One could opt to replace with an arbitrary alphanumeric character, but this would result in unpredictable and weird link labels.
Probably the best solution is to allow users to explicitly define the desired link label in this case.
This requires `output_filenames` to allow tuples instead of strings as its list element, where the tuple consists of the expected output filename and the link label that is to be used.
This approach only really works though for explicit output filenames without wildcards, because for globbed filepaths the same problem as before applies.

### Process node
The `shellfunction`, when executed, is represented by an instance of the `ShellFunctionNode` class.
This is a subclass of the `CalculationNode` class, just as the `CalcFunctionNode` which is the representation of `calcfunction` executions, since it creates new data nodes.
The choice to create a new subclass, instead of repurposing the `CalcFunctionNode`, was made not only for consistency and clarity, but also since this makes it possible to easily distinguish `shellfunction` executions from `calcfuntion` executions in the provenance graph.
By having a separate ORM class, it is also easy to query directly for `ShellFunctionNode` instances.

AiiDA's provenance grammar rules are not complicated by this addition because they only apply to the `CalculationNode` level.
Since the `ShellFunctionNode` class is a subclass of this, all the rules are properly defined and the grammar does not need to be modified.


### Open Questions

This section records open questions on the design and behavior of the new functionality.
Once the questions have been answered in the process of the AEP evaluation, answers are added and serve to record the discussion surrounding the design choices.

0 comments on commit 333536f

Please sign in to comment.