Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AEP: Infrastructure to easily run shell commands through AiiDA #32

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Dec 16, 2021

  • Used AEP template from AEP 0
  • Status is submitted
  • Added type & status labels to PR
  • Added AEP to README.md
  • Provided github handles for authors

This is a new AEP that I discussed during the last AiiDA meeting. I already have a working implementation. All the code examples that you see in this AEP have actually been run with that implementation based on the current develop branch of aiida-core.

I think the functionality already covers a large number of use cases, but I am sure there is more to be added. For example, I am thinking that certain commands could natively work with directories as opposed to files. We might therefore want to add special support for FolderData just as there already is for SinglefileData.

The last example show cases a complete example of how easy it is to now define a workflow of multiple shell commands while the provenance is kept as with typical AiiDA workflows. I think this looks very powerful and may make AiiDA accessible to a lot of other fields whose workflow patterns are mostly composed of simple file-based operations with shell commands.

The final section contains some open questions that I have on the current design of the interface and implementation. I welcome any discussion and ideas here, as well of course suggestions that aren't mentioned yet.

Pinging people that might be interested to have a look @ltalirz @giovannipizzi @espenfl @greschd @ramirezfranciscof @chrisjsewell @yakutovicha

Open questions and points of discussion:

  • Should the decorator provide a simple boolean flag to capture stdout, e.g., capture_stdout = True, or should it take a relative filename to be used for the SinglefileData that captures the output. The latter gives more flexibility, but the question is if it is really needed, since anyway the filename used in the SinglefileData is rarely ever used again, so it could be anything. The current implementation provides the attach_stdout flag, which when True, will not write stdout to the process node's repo, but will instead attach it as a SinglefileData output.
  • Should stdout and stderr be treated differently from output files that are generated by the command by storing the stdout and stderr directly in the file repository of the ShellFunctionNode, or should they be attached as separate SinglefileData output nodes? The advantage of having them as nodes, is that typically the stdout is the main output of the command and one wants to use that as input for the next command, which makes it useful to have it as an output node instead of a file in the repository of the shellfunction. Both stdout and stderr are by default written to the process node's repo. The attach_stdout can change this for the stdout to be attached as an output node instead.
  • Should the decorator reuse the CalcFunctionNode for its instances in the interest of not overcomplicating the provenance graph grammar (which may have unforeseen consequences) or should we give it its own ShellFunctionNode which makes it possible to distinguish calcfunctions from shellfunctions in querying. The ShellFunctionNode has been implemented and does not require the provenance grammar to be updated.
  • Should the specification of an invalid command (where shutil.which(command) returns None) be raised as exception and let it bubble up, or should we return fixed exit code? An exit code will be returned
  • Is it possible to automatically add **kwargs to the signature of the decorated function such that the user no longer explicitly has to do so. Implemented
  • Is it possible to have the shellfunction capture multiple output files? The expected output filenames could be defined explicitly or with wildcards and the engine automatically wraps them in SinglefileData, such that the user doesn't have to do this manually in the function body. Implemented
  • If output files, that are supposed to be attached as SinglefileData output nodes, are specified by their filename, the filename is also the most logical choice to use as the link label with which to attach the output node. However, valid filenames are not always valid link labels, which can only contain underscores and alphanumeric characters. For example output.txt is a valid filename but not a valid link label. Now we could automatically convert periods into underscores, but this would lose information. For example, it would be impossible to say whether some_output_name corresponded to the filename some_output_name, some_output.name or even some.output.name. The question is if there is a way around this or if this would even pose a problem. The original filename would still be stored as the filename attribute of the SinglefileData. If this is not acceptable, maybe output_filenames should accept a list of tuples instead of a list of strings, where each tuple specifies both the filename as well as the output link label.

@sphuber
Copy link
Contributor Author

sphuber commented Dec 16, 2021

For convenience, I copy part of the AEP here that shows the final result. With this AEP, it is now possible to write the following code:

from io import StringIO
from pathlib import Path

from aiida.orm import List, SinglefileData
from aiida.engine.processes.functions import shellfunction, workfunction


@shellfunction(command='split', output_filenames=['x*'])
def split():
    """Run the ``split`` command."""


@shellfunction(command='head', attach_stdout=True)
def head():
    """Run the ``head`` command."""


@shellfunction(command='cat', attach_stdout=True)
def cat():
    """Run the ``cat`` command."""


@workfunction
def workflow(single_file):
    """Simple workflow that takes a file, splits it, removes the last line of each and reconcatenates."""

    # Split the file into five files of two lines each.
    arguments = List(['-l', '2', '{single_file}'])
    file_split = split(arguments=arguments, single_file=single_file)

    # Remove the last line of each of the five files.
    arguments = List(['-n', '1', '{single_file}'])
    files_truncated = {key: head(arguments=arguments, single_file=single_file) for key, single_file in file_split}

    # Concatenate the five files into a single file.
    files_concatenated = cat(arguments=List([f'{{{key}}}' for key in files_truncated.keys()]), **files_truncated)

    return files_concatenated


# Create input file with the numbers 0 through 9, one number per line.
single_file = SinglefileData(StringIO('\n'.join([str(i) for i in range(10)])))
workflow(single_file)

It defines a workfunction that invokes three shell commands in succession (split, head and cat) to split an input file in multiple files, take the first line, and concatenate the results. Full provenance is kept as usual and looks like the following:
Provenance graph

@ltalirz
Copy link
Member

ltalirz commented Dec 16, 2021

Thanks @sphuber !
Although I've never had the use case myself, I can see how this could simplify some users' lives, which I'm certainly in favor of.

Here are my comments from quickly browsing through:

  • Since this affects the calculation types, I guess this can't easily be done as a plugin? I always thought there would come the day when we want to be also extend those... we should be careful, though - our provenance model is already very complex and every new concept will need to be explained to and understood by users.
  • If I understand correctly, this implementation would already be close to the renku run use case - the main difference is that we're telling AiiDA the output explicitly while renku looks at what changed in the working directory. However, with your implementation it seems straightforward to add e.g. a verdi track --output abc.out cat file1 file2 > abc.out command (not that straightforward after all, since you also need to tell it what inputs are options and what inputs are files)
    Here we're getting into a territory that is already occupied by many other simple workflow managers, so perhaps we should look at best practices (and, down the line, perhaps compatibility/an interface to other languages like CWL).
  • You explicitly mention that this AEP is limited to executing the command locally but I would bet once you give users this, the demand for having a @shelljob (or however you wanna call it) will soon follow.

@sphuber
Copy link
Contributor Author

sphuber commented Dec 16, 2021

Thanks for the review @ltalirz .

Thanks @sphuber ! Although I've never had the use case myself, I can see how this could simplify some users' lives, which I'm certainly in favor of.

Me either, it is not that common a use case in our field. But I have seen it time and time again in closely related fields and have seen people get scared away because of this limitation. Note that in the examples I use basic UNIX commands, but this could also be used for any executable that can be invoked locally with a CLI. I think this will allow to quickly incorporate components where it really doesn't pay to develop an entire CalcJob plugin.

Since this affects the calculation types, I guess this can't easily be done as a plugin? I always thought there would come the day when we want to be also extend those... we should be careful, though - our provenance model is already very complex and every new concept will need to be explained to and understood by users.

This is also one of the open questions I have at the end of the AEP. In the current implementation I don't create a new process node type but simply use the CalcFunctionNode. This works just fine and makes sense since, just like the calcfunction, it creates new data and so should get the same types of links. The only downside is that by not distinguishing, it is not easy to query for just the shellfunction. If we were to imagine that it would get a ShellFunctionNode instead, then we could query for them with QueryBuilder().append(ShellFunctionNode). It might still be possible by making the ShellFunctionNode a subclass of the CalcFunctionNode which will make them distinguishable in querying, but won't require any changes in the provenance graph "grammar". I will experiment a bit more to see what option makes most sense.

If I understand correctly, this implementation would already be close to the renku run use case - the main difference is that we're telling AiiDA the output explicitly while renku looks at what changed in the working directory. However, with your implementation it seems straightforward to add e.g. a verdi track --output abc.out cat file1 file2 > abc.out command (not that straightforward after all, since you also need to tell it what inputs are options and what inputs are files). Here we're getting into a territory that is already occupied by many other simple workflow managers, so perhaps we should look at best practices (and, down the line, perhaps compatibility/an interface to other languages like CWL).

Fully agree that it would be wise to look at other workflow systems. As I described in the introduction, this functionality is to exactly replicate what they provide but AiiDA is severly lacking. I think it is fine to "duplicate" this in AiiDA because it makes it very easy to integrate these kinds of use-cases in AiiDA. Another option would be to simply runs these parts of workflows with other systems and then interface somehow with AiiDA for the parts that are more suited. But I think this approach is better because the entire workflow is fully captured by AiiDA's provenance graph in a single consistent way.

A thing I have already thought about, but not yet fleshed out, is that, as you suggest, it might potentially be possible to write translators that interpret a static markup file with a workflow of shellfunctions, and this is automatically converted in the Python code which is then executed. This would open the door to making AiiDA a CWL engine with the full power of AiiDA's provenance. But how realistic this is, is not yet clear to me. But it might be interesting to investigate as it could be very powerful.

You explicitly mention that this AEP is limited to executing the command locally but I would bet once you give users this, the demand for having a @shelljob (or however you wanna call it) will soon follow.

I think it is always fine to get feature requests; whether we can actually implement them, is a second. I take it you don't mean that if we think we wouldn't be able to extend the functionality to run on remote, that would be a deal breaker of accepting this?

@ltalirz
Copy link
Member

ltalirz commented Dec 19, 2021

Thanks for sharing your thoughts @sphuber, I don't have anything to add.

I take it you don't mean that if we think we wouldn't be able to extend the functionality to run on remote, that would be a deal breaker of accepting this?

Not, it's not a deal breaker.

@Drvanon
Copy link

Drvanon commented Jan 14, 2022

This would be a "product selling feature" for us. I am doing bioinformatics for a pharmaceutical company. In our field there are a number of programs that are very simple in their specifications for input and output. A very nice example of this are the haddock PDB tools. Each of these programs have at most two files as input (usually just one, provided through the stdin) and output to stdout, allowing for the creation of longer pipelines. For example: $ pdb_selchain -A,D 1brs.pdb | pdb_delhetatm | pdb_tidy > 1brs_AD_noHET.pdb For this scenario using AiiDA might be overkill, but imagine that this would go into some further structure prediction program, it becomes much more reasonable. Then if you take into account that often you need to play around with different parts to your pipeline and it suddenly becomes very reasonable to have AiiDA as a supervising process.

Right now however, having to write a calcjob, code and parser for each and every item is just not worth the effort.

@espenfl
Copy link

espenfl commented Jan 14, 2022

I would also add my full support for this. There are many reasons why it is nice and convenient to have this functionality. Strict provenance is not always the main goal and other concerns drive the development. Great addition and thanks.

@Drvanon
Copy link

Drvanon commented Jan 14, 2022

I wanted to also comment that I don't think having the shellfunction be a decorator makes much sense, since there is no intuitive role for the function body. Instead, I would argue that a class would fit the AiiDA architecture much better. I would suggest the following API wrapping around what HADDOCK would do. (I may have gone a little wild with this one).

# PDBData class would obviously have to be fleshed out a lot more.
class PDBData(Str):
    @classmethod
    def from_pdb(cls, inputfile):
        pass

    def to_pdb(self, outputfile):
        pass

class PDBParser(ShellParser):
    def parse(self):
        self.out("structure", PDBData.from_pdb(self.retrieved.open("stdout", 'r')))


class SelectChain(Shelljob):
    command = "pdb_selchain"
    parser = PDBParser

    def define(self, spec):
        # this should define stdin as an input and stdout as an output
        super(self).define(spec)
        spec.input("chains", valid_type=List)
        spec.output("structure", valid_type=PDBData)

    # Not too sure about this function name
    def run(self, stdin):
        self.arguments = ["-" + ",".join(self.input.chains)]
        super(self).run(stdin=stdin)


class DelHetAtoms(Shelljob):
    command = "pdb_delhetatm"
    parser = PDBParser


class PDBTidy(ShellJob):
    command = pdb_tidy
    parser = PDBParser


@workfunction
def workflow(structure):
    """Simple workflow that takes a file, splits it, removes the last line of each and reconcatenates."""

    chains = SelectChain(chains=["A", "D"]).run(stdin=structure.to_pdb())
    without_hetatms = DelHetAtoms().run(stdin=chains.to_pdb())
    return PDBTidy().run(stdin=without_hetams.to_pdb())

edit: I feel like I am not 100% using the spec object as you guys do, and the way I use run is definitely off, but I hope you get the core of my idea.

@sphuber
Copy link
Contributor Author

sphuber commented Jan 14, 2022

Thanks a lot for the comments and suggestions @Drvanon . I will have a look at your interface ideas and see what I can incorporate.

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 17, 2022

So I think it's a good starting point, thanks.
But, firstly, it feels unnecessarily restrictive, in how it generates outputs.

I would suggest something along the lines of:

from pathlinb import Path
from typing import Dict
from aiida.orm import Data, SinglefileData, Str

@shellfunction('abc')
def abc(output: Path, stdout: str, stderr: str) -> Dict[str, Data]:
    """Run the ``abc`` command."""
	return {
        "stdout": Str(stdout),
        "stderr": Str(stderr),
        "output": SinglefileData(output / "output.txt")
    }

this gives users full control over what they output, and allows for post-processing steps

@sphuber
Copy link
Contributor Author

sphuber commented Jan 17, 2022

But, firstly, it feels unnecessarily restrictive, in how it generates outputs.

In the first design, I took the same approach that you suggest (more or less) and the current implementation still injects the working directory as a path into the function body, exactly such that the implementation can access and return any generated output. However, I think that it is crucial that this is not a requirement and it should be possible to just attach outputs without having to implement any Python logic. That being said, having the option to manually control output for those cases where it is needed, might still be a good idea.

@chrisjsewell
Copy link
Member

However, I think that it is crucial that this is not a requirement and it should be possible to just attach outputs without having to implement any Python logic.

Why?

@sphuber
Copy link
Contributor Author

sphuber commented Jan 17, 2022

Why?

Because one of the main reasons for this AEP was too make it easy to run external shell commands through AiiDA. Other workflow systems make this dead easy and people expect something similar in AiiDA. There is a big difference for this user type to just specifying the expected filename that should be captured and writing explicit Python to do it.

It also makes it a lot easier to allow defining the shellfunctions through static markup, such as YAML. I actually already have a working proof of concept that takes a CWL spec (common workflow language) and runs it through AiiDA as shellfunctions. This is relatively easy because it is not necessary to write actual Python to do the file capture.

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 17, 2022

It also makes it a lot easier to allow defining the shellfunctions through static markup, such as YAML.

There is a big difference between a Python API (which this AEP is for) and a specification for a declarative format for workflows.
To use this decorator, users already have to write Python,
and what we will end up having is many keyword arguments, to try to capture all possible output possibilities.
A declarative workflow, should be a separate concern, and thus a separate AEP

Basically, I don't think it inhibits implementing (a more restricted) declarative workflow, whilst having a less restricted Python API. This is essentially what we do all the time, when exposing verdi CLI commands for aspects of the Python API

@sphuber
Copy link
Contributor Author

sphuber commented Jan 17, 2022

A declarative workflow, should be a separate concern, and thus a separate AEP

I am not suggesting to implement one here, I am just saying that the interface that this AEP introduces would allow to easily implement that as well. I marked it as a secondary argument though, so it is not the most important concern.

To use this decorator, users already have to write Python,
and what we will end up having is many keyword arguments, to try to capture all possible output possibilities.

I stand by my point that the main target use-case is the running of a very simple command line program with some basic file output capturing and that should be as simple as possible. Providing some keyword arguments that automate the output capturing are a lot simpler than having to explain to users to capture it themselves and return them as output nodes. Given that for those that want or need the additional flexibility we can still provide it, I think that is the way to go.

I am actually even thinking whether the concept of the shellfunction decorator as the interface is really the best way to go. I chose it primarily because there is a nice analog with the calcfunction, so it is somewhat familiar to users of AiiDA, and it was easy to implement. That being said, it can be very counter-intuitive for people that are new to AiiDA. It is weird that you declare the command and expected outputs first, and only then call it with the input arguments. In your head, one typically defines the inputs first and then what the expected outputs should be. The few non-AiiDA people that I have spoken to so far confirm this.

@chrisjsewell
Copy link
Member

Providing some keyword arguments that automate the output capturing
Given that for those that want or need the additional flexibility we can still provide it, I think that is the way to go.

Yep, I feel the API needs to allow for the flexibility, otherwise you are just going to end up with hundreds of keyword arguments, to try and capture all possible user requirements
But fine to "default" to a set output.
Honestly, I would get rid of at least output_filenames, because this is detracting from the "simplicity" argument

@q-posev
Copy link

q-posev commented Jan 18, 2022

Very interested to see where this AEP goes. For us at TREX-CoE this would also be a very useful feature, since we have developed a data format (and the associated API) to communicate wavefunctions between the codes. We have a CLI to convert outputs of some external codes into our format. At the moment I have to write a new AiiDA plugin for this CLI in order to convert files between different steps of the provenance. Thus, having a one-liner CLI functionality would be awesome.

@sphuber
Copy link
Contributor Author

sphuber commented Jan 20, 2022

Update

After the first round of review and comments, there are a few more observations that I think should guide the design:

  1. The interface should provide as much flexibility in the Python API as possible but make it fully optional. That is to say, for the majority of users who do not need advanced behavior, it should be dead easy to use and the syntax should be as clean and readable as possible. A good rule-of-thumb I think is that it should be possible to declare it in a static markup language.
  2. There seems to be a need to be able to run shell commands on machines other than localhost. This would clearly make the interface more complex, as users will have to define a "computer" and so this will take additional configuration, but this might be worth the additional flexibility. Again, making sure this is optional and not requiring this complexity for users that don't need it is of utmost importance.
  3. Whenever possible, the running of shell commands should not require the implementation of custom Python code that needs to be registered through entry points or be made importable. This includes allowing the shell jobs to be runnable through the daemon without having to restart the interpreter that is launching them, nor the daemon itself.

Design of alternative or additional interface

To accommodate the requirement of running the shell job on remote machines, I have drafted a new interface and accompanying implementation. From the user's perspective, it looks as follows:

from aiida.engine import shell_job

single_file = orm.SinglefileData(StringIO('\n'.join([str(i) for i in range(10)])))
results = shell_job(
    'split',
    ['-l', '2', '{single_file}'],
    files={'single_file': single_file},
    outputs=['x*']
)

This would launch the command split -l 2 single_file on the localhost. It actually launches a ShellCalculation which is an implementation of the CalcJob interface. The shell_job function is a simple wrapper that provides a simplified interface and launches the ShellJob. In addition, it will turn the command split in a Code instance (if it doesn't already exist) making sure to determine the full path of the binary and assigning the localhost as Computer. If the localhost computer doesn't exist yet, this is automatically created and configured with optimal settings for efficienty (setting waiting intervals to zero). This interface ensures that it is as simple as possible for the great majority of simple use cases.

That being said, because it is implemented as a CalcJob, it is possible to launch this on a remote machine. In this case, one simply specifies the computer through the metadata.options just as one would with a normal CalcJob. Through the shell_job function this would like:

from aiida.engine import shell_job

single_file = orm.SinglefileData(StringIO('\n'.join([str(i) for i in range(10)])))
results = shell_job(
    'split',
    ['-l', '2', '{single_file}'],
    files={'single_file': single_file},
    outputs=['x*'],
    metadata={
        'options': {
            'computer': load_computer('remote_machine')
        }
    }
)

This will once again create a Code on the fly for the given Computer where it automatically expands the simple command name in the full path, which is required by the Code class. For simplicity, we could consider to allow specifying options directly as a keyword and not nested within metadata, but then again, users might want to set other metadata inputs, such as label and description through this keyword.

Now, if we take the same example workflow as described in the current AEP, but use the new ShellJob, it would look like the following:

from aiida import engine

@engine.workfunction
def workflow(single_file):
    results_split = engine.shell_job(
        'split',
        ['-l', '2', '{single_file}'],
        files={'single_file': single_file},
        outputs=['x*']
    )

    results_head = {}

    for label, node in sorted(results_split.items()):
        results = engine.shell_job(
            'head',
            ['-n', '1', '{single_file}'],
            files={'single_file': node},
            filenames={'single_file': label}
        )
        results_head[label] = results['stdout']

    results_cat = engine.shell_job(
        'cat',
        [f'{{{key}}}' for key in results_head.keys()],
        files=results_head
    )

    return results_cat


single_file = orm.SinglefileData(StringIO('\n'.join([str(i) for i in range(10)])))
results, node = workflow.run_get_node(single_file)
print(f'Workflow<{node.pk}> finished: {results}')

The provenance graph looks slightly more complex because each command now has two additional output nodes (the RemoteData and the retrieved FolderData nodes).

2444 dot png-1

That being said, the interface doesn't look that much more complicated than that of the shellfunction and it no longer suffers from the counterintuitive design where the outputs are defined in the decorator before the input arguments are specified and passed to the function call. It feels more declarative and logical in time. On top of that, it has the big added bonus that this can now be easily run on remote machines as well as on the localhost, which is a big plus. Still, it will have a non-negligibly larger overhead because the CalcJob implementation is more costly than the process function. But this may be acceptable in some cases. One question I have is whether it makes sense to keep the shellfunction around and default to that for executions on localhost making it faster and only delegate to the ShellJob when it is on a remote machine. This complexity could be abstracted way from the user through the shell_job wrapper function.

Next steps

I think at this point it is clear that we have two designs each with its advantages and disadvantages. I would love to hear more feedback on the concept of the ShellJob and the shell_job wrapping function as a simplified interface. If you want to give this a spin, let me know and I will make the implementation available so you can try this for your use case. I think that is the best way to flesh out the design and identify weakpoints in the functionality or unintuitiveness in the interface.

@Drvanon
Copy link

Drvanon commented Jan 21, 2022

Great suggestion! I think one thing that I wasn't able to fully make out is if you are planning to include easy ways for including parsers for the shell_jobs, "queriability" of the contents of the files would be amazing for us.

@sphuber
Copy link
Contributor Author

sphuber commented Jan 21, 2022

parsers for the shell_jobs, "queriability" of the contents of the files would be amazing for us.

That's a good point. The difficulty here will lie in making it possible to write these parsers dynamically. That is to say for example just write them in a notebook and run them through the daemon without first having to make sure the parser is written in a module that is importable. I am not sure how to accomplish this since the parser will most likely have to be Python code and so it will have to be readable by the daemon somehow. Of course it would be possible to define the parser locally and then run it, instead of submitting it to the daemon, but this will not scale and may not be useful for all usecases. Will have to think about it some more.

@chrisjsewell
Copy link
Member

Just to clarify one thing: is this essentially a "convenience function", which is wrapping existing aiida-core functionality, or do you envisage that it requires any actual "structural" change to aiida-core itself?

@sphuber
Copy link
Contributor Author

sphuber commented Jan 21, 2022

Just to clarify one thing: is this essentially a "convenience function", which is wrapping existing aiida-core functionality, or do you envisage that it requires any actual "structural" change to aiida-core itself?

A bit in between I would say. For the shellfunction I add new functionality, but it still leverages the existing process_function decorator. For the shell jobs, it really is just an implementation of a CalcJob (ShellCalculation) with a Parser (ShellParser). The shell_job is a pure convenience function to simplify launching a ShellCalculation. So both concepts in essence reuse the normal engine and are ultimately instances of a Process that run. This would essentially mean that it could also be just provided by a plugin. Except for the fact that I think it would be useful to have separate ProcessNode classes to represent the executions of the shellfunction and ShellCalculation for querying purposes, and this cannot be extended by a plugin, but has to be done in aiida-core.

You can check the implementation in this branch and even give it a spin.

@chrisjsewell
Copy link
Member

Yeh cheers, I ask because I feel initially this should indeed be a plugin.
This is not to say that eventually it would be part of aiida-core but, as a plugin, it would allow for a lot more flexible development (from user feedback), rather than being beholden to the release cadence of aiida-core.

Except for the fact that I think it would be useful to have separate ProcessNode classes to represent the executions of the shellfunction and ShellCalculation for querying purposes, and this cannot be extended by a plugin, but has to be done in aiida-core.

is this not something we should try to "abstract" in aiida-core, to allow for

@sphuber
Copy link
Contributor Author

sphuber commented Jan 21, 2022

Yeh cheers, I ask because I feel initially this should indeed be a plugin. This is not to say that eventually it would be part of aiida-core but, as a plugin, it would allow for a lot more flexible development (from user feedback), rather than being beholden to the release cadence of aiida-core.

Might be a good idea to initially provide this as a standalone package that people can install. The increased flexibility for development would indeed be worth something and then when it is mature and worked out, we can move it into aiida-core.

Except for the fact that I think it would be useful to have separate ProcessNode classes to represent the executions of the shellfunction and ShellCalculation for querying purposes, and this cannot be extended by a plugin, but has to be done in aiida-core.

is this not something we should try to "abstract" in aiida-core, to allow for

Not sure. The reason we didn't allow this is because this directly touches the provenance graph syntax rules. There is quite a bit of internal logic that make assumptions about this, that when not respected could have pretty grave consequences. Therefore it is not really safe to let users extend this willy-nilly. For the current implementation I have it is not really necessary to have new node subtypes, but I do think it is useful in this case for queryability, and I think it works without breaking our provenance grammar.

@sphuber
Copy link
Contributor Author

sphuber commented Jan 21, 2022

On second thought, there is another downside to having this developed initially as an external package and that is that we would have to change the entry points of the calcjob and parser plugins. Once we move them to aiida-core, existing nodes will have the incorrect process type and it is not clear to me if we can provide a data migration, even from aiida-core.

@chrisjsewell
Copy link
Member

we would have to change the entry points of the calcjob and parser plugins.

If you are just talking about the name prefixes? I don't see this as being the case. there is no technical reason why you cannot initially name them as core. or have non core prefixes within aiida-core. It's only a guideline

These changes reflect the current state of the user interface and
implementation design as implemented by `aiida-shell`.
@Drvanon
Copy link

Drvanon commented Oct 15, 2022

Hey all,
what is the status on this project? In a few months we will be revisiting our pipeline software and I want to take aiida along in that discussion. Our situation hasn't really changed much from the aiida, other than that the cli functions we call have changed a bit.

@sphuber
Copy link
Contributor Author

sphuber commented Oct 17, 2022

@Drvanon I have updated this AEP with the most recent design. I have implemented it as a stand-alone plugin aiida-shell. Instructions are on the README of the repository. Please give that a go and see if that allows you to easily run your pipeline. Let me know if you run into problems or have questions

@sphuber
Copy link
Contributor Author

sphuber commented Oct 20, 2022

@Drvanon I have implemented the custom parsing feature. I have added it to this branch on the aiida-shell repository. See the README.md for a description of the functionality. In short, it allows you to do something like:

from aiida_shell import launch_shell_job

def parser(self, dirpath):
    """Custom parsing defined on-the-fly."""
    from aiida.orm import Str
    return {'string': Str((dirpath / 'stdout').read_text().strip())}

results, node = launch_shell_job(
    'echo',
    arguments=['some output'],
    parser=parser
)
print(results['string'].value)

This example is of course trivial and there is little sense of converting the output of a file to a Str node, but you can return any Data node, and any number of a data nodes. Let me know if this would satisfy your use case (I think it should) and feel free to try it out.

@sphuber
Copy link
Contributor Author

sphuber commented Oct 21, 2022

@Drvanon I looked back in the history of this discussion and went back to your comment describing what a typical workflow would be in your field. Specifically chaining various operations of the pdb-tools package to operate on PDB protein files. Basically this shell pipeline:

pdb_fetch 1brs | pdb_selchain -A,D | pdb_delhetatm | pdb_tidy > 1brs_AD_noHET.pdb

I converted this to a script using aiida-shell and I think it is starting to become very easy to use. Have a look for yourself:

#!/usr/bin/env runaiida
"""Simple ``aiida-shell`` script to manipulate a protein defined by a .pdb file.

In this example, we show how the following shell pipeline:

    pdb_fetch 1brs | pdb_selchain -A,D | pdb_delhetatm | pdb_tidy > 1brs_AD_noHET.pdb

can be represented using ``aiida-shell`` by chaining a number of ``launch_shell_job`` calls.
All that is required for this to work is a configured AiiDA profile and that ``pdb-tools`` is installed.
"""
from aiida_shell import launch_shell_job

results, node = launch_shell_job(
    'pdb_fetch',
    arguments=['1brs'],
)

results, node = launch_shell_job(
    'pdb_selchain',
    arguments=['-A,D', '{pdb}'],
    nodes={'pdb': results['stdout']}
)

results, node = launch_shell_job(
    'pdb_delhetatm',
    arguments=['{pdb}'],
    nodes={'pdb': results['stdout']}
)

results, node = launch_shell_job(
    'pdb_tidy',
    arguments=['{pdb}'],
    nodes={'pdb': results['stdout']}
)

print(f'Final pdb: {node}')
print(f'Show the content using `verdi node repo cat {node.pk} pdb')
print(f'Generate the provenance graph with `verdi node graph generate {node.pk}')

Simply make this script executable, and run it:

chmod +x example_pdb.py
./example_pdb.py

It just requires that you have installed aiida-shell and pdb-tools, and that an AiiDA profile is configured. It will produce the final desired PDB and the provenance looks as follows:
provenance

Really interested to see what you think about this and whether it looks intuitive.

@Drvanon
Copy link

Drvanon commented Nov 6, 2022

This is looking very cool! In december we are starting a new project. I'll use it to perform a little proof of concept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants