-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AEP: Infrastructure to easily run shell commands through AiiDA #32
base: master
Are you sure you want to change the base?
Conversation
292f997
to
c25e856
Compare
Thanks @sphuber ! Here are my comments from quickly browsing through:
|
Thanks for the review @ltalirz .
Me either, it is not that common a use case in our field. But I have seen it time and time again in closely related fields and have seen people get scared away because of this limitation. Note that in the examples I use basic UNIX commands, but this could also be used for any executable that can be invoked locally with a CLI. I think this will allow to quickly incorporate components where it really doesn't pay to develop an entire
This is also one of the open questions I have at the end of the AEP. In the current implementation I don't create a new process node type but simply use the
Fully agree that it would be wise to look at other workflow systems. As I described in the introduction, this functionality is to exactly replicate what they provide but AiiDA is severly lacking. I think it is fine to "duplicate" this in AiiDA because it makes it very easy to integrate these kinds of use-cases in AiiDA. Another option would be to simply runs these parts of workflows with other systems and then interface somehow with AiiDA for the parts that are more suited. But I think this approach is better because the entire workflow is fully captured by AiiDA's provenance graph in a single consistent way. A thing I have already thought about, but not yet fleshed out, is that, as you suggest, it might potentially be possible to write translators that interpret a static markup file with a workflow of shellfunctions, and this is automatically converted in the Python code which is then executed. This would open the door to making AiiDA a CWL engine with the full power of AiiDA's provenance. But how realistic this is, is not yet clear to me. But it might be interesting to investigate as it could be very powerful.
I think it is always fine to get feature requests; whether we can actually implement them, is a second. I take it you don't mean that if we think we wouldn't be able to extend the functionality to run on remote, that would be a deal breaker of accepting this? |
c25e856
to
cfde93d
Compare
Thanks for sharing your thoughts @sphuber, I don't have anything to add.
Not, it's not a deal breaker. |
This would be a "product selling feature" for us. I am doing bioinformatics for a pharmaceutical company. In our field there are a number of programs that are very simple in their specifications for input and output. A very nice example of this are the haddock PDB tools. Each of these programs have at most two files as input (usually just one, provided through the stdin) and output to stdout, allowing for the creation of longer pipelines. For example: Right now however, having to write a calcjob, code and parser for each and every item is just not worth the effort. |
I would also add my full support for this. There are many reasons why it is nice and convenient to have this functionality. Strict provenance is not always the main goal and other concerns drive the development. Great addition and thanks. |
I wanted to also comment that I don't think having the # PDBData class would obviously have to be fleshed out a lot more.
class PDBData(Str):
@classmethod
def from_pdb(cls, inputfile):
pass
def to_pdb(self, outputfile):
pass
class PDBParser(ShellParser):
def parse(self):
self.out("structure", PDBData.from_pdb(self.retrieved.open("stdout", 'r')))
class SelectChain(Shelljob):
command = "pdb_selchain"
parser = PDBParser
def define(self, spec):
# this should define stdin as an input and stdout as an output
super(self).define(spec)
spec.input("chains", valid_type=List)
spec.output("structure", valid_type=PDBData)
# Not too sure about this function name
def run(self, stdin):
self.arguments = ["-" + ",".join(self.input.chains)]
super(self).run(stdin=stdin)
class DelHetAtoms(Shelljob):
command = "pdb_delhetatm"
parser = PDBParser
class PDBTidy(ShellJob):
command = pdb_tidy
parser = PDBParser
@workfunction
def workflow(structure):
"""Simple workflow that takes a file, splits it, removes the last line of each and reconcatenates."""
chains = SelectChain(chains=["A", "D"]).run(stdin=structure.to_pdb())
without_hetatms = DelHetAtoms().run(stdin=chains.to_pdb())
return PDBTidy().run(stdin=without_hetams.to_pdb()) edit: I feel like I am not 100% using the |
Thanks a lot for the comments and suggestions @Drvanon . I will have a look at your interface ideas and see what I can incorporate. |
So I think it's a good starting point, thanks. I would suggest something along the lines of: from pathlinb import Path
from typing import Dict
from aiida.orm import Data, SinglefileData, Str
@shellfunction('abc')
def abc(output: Path, stdout: str, stderr: str) -> Dict[str, Data]:
"""Run the ``abc`` command."""
return {
"stdout": Str(stdout),
"stderr": Str(stderr),
"output": SinglefileData(output / "output.txt")
} this gives users full control over what they output, and allows for post-processing steps |
In the first design, I took the same approach that you suggest (more or less) and the current implementation still injects the working directory as a path into the function body, exactly such that the implementation can access and return any generated output. However, I think that it is crucial that this is not a requirement and it should be possible to just attach outputs without having to implement any Python logic. That being said, having the option to manually control output for those cases where it is needed, might still be a good idea. |
Why? |
Because one of the main reasons for this AEP was too make it easy to run external shell commands through AiiDA. Other workflow systems make this dead easy and people expect something similar in AiiDA. There is a big difference for this user type to just specifying the expected filename that should be captured and writing explicit Python to do it. It also makes it a lot easier to allow defining the shellfunctions through static markup, such as YAML. I actually already have a working proof of concept that takes a CWL spec (common workflow language) and runs it through AiiDA as shellfunctions. This is relatively easy because it is not necessary to write actual Python to do the file capture. |
There is a big difference between a Python API (which this AEP is for) and a specification for a declarative format for workflows. Basically, I don't think it inhibits implementing (a more restricted) declarative workflow, whilst having a less restricted Python API. This is essentially what we do all the time, when exposing verdi CLI commands for aspects of the Python API |
I am not suggesting to implement one here, I am just saying that the interface that this AEP introduces would allow to easily implement that as well. I marked it as a secondary argument though, so it is not the most important concern.
I stand by my point that the main target use-case is the running of a very simple command line program with some basic file output capturing and that should be as simple as possible. Providing some keyword arguments that automate the output capturing are a lot simpler than having to explain to users to capture it themselves and return them as output nodes. Given that for those that want or need the additional flexibility we can still provide it, I think that is the way to go. I am actually even thinking whether the concept of the |
Yep, I feel the API needs to allow for the flexibility, otherwise you are just going to end up with hundreds of keyword arguments, to try and capture all possible user requirements |
Very interested to see where this AEP goes. For us at TREX-CoE this would also be a very useful feature, since we have developed a data format (and the associated API) to communicate wavefunctions between the codes. We have a CLI to convert outputs of some external codes into our format. At the moment I have to write a new AiiDA plugin for this CLI in order to convert files between different steps of the provenance. Thus, having a one-liner CLI functionality would be awesome. |
UpdateAfter the first round of review and comments, there are a few more observations that I think should guide the design:
Design of alternative or additional interfaceTo accommodate the requirement of running the shell job on remote machines, I have drafted a new interface and accompanying implementation. From the user's perspective, it looks as follows: from aiida.engine import shell_job
single_file = orm.SinglefileData(StringIO('\n'.join([str(i) for i in range(10)])))
results = shell_job(
'split',
['-l', '2', '{single_file}'],
files={'single_file': single_file},
outputs=['x*']
) This would launch the command That being said, because it is implemented as a from aiida.engine import shell_job
single_file = orm.SinglefileData(StringIO('\n'.join([str(i) for i in range(10)])))
results = shell_job(
'split',
['-l', '2', '{single_file}'],
files={'single_file': single_file},
outputs=['x*'],
metadata={
'options': {
'computer': load_computer('remote_machine')
}
}
) This will once again create a Now, if we take the same example workflow as described in the current AEP, but use the new from aiida import engine
@engine.workfunction
def workflow(single_file):
results_split = engine.shell_job(
'split',
['-l', '2', '{single_file}'],
files={'single_file': single_file},
outputs=['x*']
)
results_head = {}
for label, node in sorted(results_split.items()):
results = engine.shell_job(
'head',
['-n', '1', '{single_file}'],
files={'single_file': node},
filenames={'single_file': label}
)
results_head[label] = results['stdout']
results_cat = engine.shell_job(
'cat',
[f'{{{key}}}' for key in results_head.keys()],
files=results_head
)
return results_cat
single_file = orm.SinglefileData(StringIO('\n'.join([str(i) for i in range(10)])))
results, node = workflow.run_get_node(single_file)
print(f'Workflow<{node.pk}> finished: {results}') The provenance graph looks slightly more complex because each command now has two additional output nodes (the That being said, the interface doesn't look that much more complicated than that of the Next stepsI think at this point it is clear that we have two designs each with its advantages and disadvantages. I would love to hear more feedback on the concept of the |
Great suggestion! I think one thing that I wasn't able to fully make out is if you are planning to include easy ways for including parsers for the shell_jobs, "queriability" of the contents of the files would be amazing for us. |
That's a good point. The difficulty here will lie in making it possible to write these parsers dynamically. That is to say for example just write them in a notebook and run them through the daemon without first having to make sure the parser is written in a module that is importable. I am not sure how to accomplish this since the parser will most likely have to be Python code and so it will have to be readable by the daemon somehow. Of course it would be possible to define the parser locally and then run it, instead of submitting it to the daemon, but this will not scale and may not be useful for all usecases. Will have to think about it some more. |
Just to clarify one thing: is this essentially a "convenience function", which is wrapping existing aiida-core functionality, or do you envisage that it requires any actual "structural" change to aiida-core itself? |
A bit in between I would say. For the You can check the implementation in this branch and even give it a spin. |
Yeh cheers, I ask because I feel initially this should indeed be a plugin.
is this not something we should try to "abstract" in aiida-core, to allow for |
Might be a good idea to initially provide this as a standalone package that people can install. The increased flexibility for development would indeed be worth something and then when it is mature and worked out, we can move it into
Not sure. The reason we didn't allow this is because this directly touches the provenance graph syntax rules. There is quite a bit of internal logic that make assumptions about this, that when not respected could have pretty grave consequences. Therefore it is not really safe to let users extend this willy-nilly. For the current implementation I have it is not really necessary to have new node subtypes, but I do think it is useful in this case for queryability, and I think it works without breaking our provenance grammar. |
On second thought, there is another downside to having this developed initially as an external package and that is that we would have to change the entry points of the calcjob and parser plugins. Once we move them to |
If you are just talking about the name prefixes? I don't see this as being the case. there is no technical reason why you cannot initially name them as |
333536f
to
dc29c17
Compare
These changes reflect the current state of the user interface and implementation design as implemented by `aiida-shell`.
dc29c17
to
fbdf3da
Compare
Hey all, |
@Drvanon I have updated this AEP with the most recent design. I have implemented it as a stand-alone plugin |
@Drvanon I have implemented the custom parsing feature. I have added it to this branch on the from aiida_shell import launch_shell_job
def parser(self, dirpath):
"""Custom parsing defined on-the-fly."""
from aiida.orm import Str
return {'string': Str((dirpath / 'stdout').read_text().strip())}
results, node = launch_shell_job(
'echo',
arguments=['some output'],
parser=parser
)
print(results['string'].value) This example is of course trivial and there is little sense of converting the output of a file to a |
@Drvanon I looked back in the history of this discussion and went back to your comment describing what a typical workflow would be in your field. Specifically chaining various operations of the
I converted this to a script using #!/usr/bin/env runaiida
"""Simple ``aiida-shell`` script to manipulate a protein defined by a .pdb file.
In this example, we show how the following shell pipeline:
pdb_fetch 1brs | pdb_selchain -A,D | pdb_delhetatm | pdb_tidy > 1brs_AD_noHET.pdb
can be represented using ``aiida-shell`` by chaining a number of ``launch_shell_job`` calls.
All that is required for this to work is a configured AiiDA profile and that ``pdb-tools`` is installed.
"""
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
'pdb_fetch',
arguments=['1brs'],
)
results, node = launch_shell_job(
'pdb_selchain',
arguments=['-A,D', '{pdb}'],
nodes={'pdb': results['stdout']}
)
results, node = launch_shell_job(
'pdb_delhetatm',
arguments=['{pdb}'],
nodes={'pdb': results['stdout']}
)
results, node = launch_shell_job(
'pdb_tidy',
arguments=['{pdb}'],
nodes={'pdb': results['stdout']}
)
print(f'Final pdb: {node}')
print(f'Show the content using `verdi node repo cat {node.pk} pdb')
print(f'Generate the provenance graph with `verdi node graph generate {node.pk}') Simply make this script executable, and run it:
It just requires that you have installed Really interested to see what you think about this and whether it looks intuitive. |
This is looking very cool! In december we are starting a new project. I'll use it to perform a little proof of concept. |
submitted
README.md
This is a new AEP that I discussed during the last AiiDA meeting. I already have a working implementation. All the code examples that you see in this AEP have actually been run with that implementation based on the current
develop
branch ofaiida-core
.I think the functionality already covers a large number of use cases, but I am sure there is more to be added. For example, I am thinking that certain commands could natively work with directories as opposed to files. We might therefore want to add special support for
FolderData
just as there already is forSinglefileData
.The last example show cases a complete example of how easy it is to now define a workflow of multiple shell commands while the provenance is kept as with typical AiiDA workflows. I think this looks very powerful and may make AiiDA accessible to a lot of other fields whose workflow patterns are mostly composed of simple file-based operations with shell commands.
The final section contains some open questions that I have on the current design of the interface and implementation. I welcome any discussion and ideas here, as well of course suggestions that aren't mentioned yet.
Pinging people that might be interested to have a look @ltalirz @giovannipizzi @espenfl @greschd @ramirezfranciscof @chrisjsewell @yakutovicha
Open questions and points of discussion:
Should the decorator provide a simple boolean flag to capture stdout, e.g.,The current implementation provides thecapture_stdout = True
, or should it take a relative filename to be used for theSinglefileData
that captures the output. The latter gives more flexibility, but the question is if it is really needed, since anyway the filename used in theSinglefileData
is rarely ever used again, so it could be anything.attach_stdout
flag, which whenTrue
, will not writestdout
to the process node's repo, but will instead attach it as aSinglefileData
output.ShouldBothstdout
andstderr
be treated differently from output files that are generated by the command by storing thestdout
andstderr
directly in the file repository of theShellFunctionNode
, or should they be attached as separateSinglefileData
output nodes? The advantage of having them as nodes, is that typically thestdout
is the main output of the command and one wants to use that as input for the next command, which makes it useful to have it as an output node instead of a file in the repository of the shellfunction.stdout
andstderr
are by default written to the process node's repo. Theattach_stdout
can change this for thestdout
to be attached as an output node instead.Should the decorator reuse theTheCalcFunctionNode
for its instances in the interest of not overcomplicating the provenance graph grammar (which may have unforeseen consequences) or should we give it its ownShellFunctionNode
which makes it possible to distinguish calcfunctions from shellfunctions in querying.ShellFunctionNode
has been implemented and does not require the provenance grammar to be updated.Should the specification of an invalid command (whereAn exit code will be returnedshutil.which(command)
returnsNone
) be raised as exception and let it bubble up, or should we return fixed exit code?Is it possible to automatically addImplemented**kwargs
to the signature of the decorated function such that the user no longer explicitly has to do so.Is it possible to have theImplementedshellfunction
capture multiple output files? The expected output filenames could be defined explicitly or with wildcards and the engine automatically wraps them inSinglefileData
, such that the user doesn't have to do this manually in the function body.If output files, that are supposed to be attached asSinglefileData
output nodes, are specified by their filename, the filename is also the most logical choice to use as the link label with which to attach the output node. However, valid filenames are not always valid link labels, which can only contain underscores and alphanumeric characters. For exampleoutput.txt
is a valid filename but not a valid link label. Now we could automatically convert periods into underscores, but this would lose information. For example, it would be impossible to say whethersome_output_name
corresponded to the filenamesome_output_name
,some_output.name
or evensome.output.name
. The question is if there is a way around this or if this would even pose a problem. The original filename would still be stored as thefilename
attribute of theSinglefileData
. If this is not acceptable, maybeoutput_filenames
should accept a list of tuples instead of a list of strings, where each tuple specifies both the filename as well as the output link label.