Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AEP: Add data plugin to model remote persistent data #27

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

broeder-j
Copy link
Member

@broeder-j broeder-j commented Oct 4, 2021

  • Used AEP template from AEP 0
  • Status is submitted
  • Added type & status labels to PR
  • Added AEP to README.md
  • Provided github handles for authors

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @broeder-j . Think this is a very interesting idea and has a lot of promise to be useful. I have left some initial questions and comments.

Comment on lines 14 to 18
Currently, AiiDA is not so good in keeping track of the provenance of large files and
of data which comes from the outside but is persistent.

For example: Large files will be duplicated on disk, for every minor change to them.
If a file should be part of the provenance it has to be stored in the repository also.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granted, when some piece of data that is stored in the repository is changed marginally, the content is currently duplicated (even under the new repository which will at least not duplicate copies of the exact same file). However, the proposed data plugin doesn't really solve that issue, does it? It merely prevents having to store the large data in the first place. Still a valid point, but find that this introduction objection is a bit misleading.

Copy link
Member Author

@broeder-j broeder-j Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it would only solve the part of large external input data, from which I need little. I just wrote this to point this out. Because if there is a 'general' solution for large files, one may not have need for such a 'PIDData' class anymore.

Comment on lines +20 to +21
If one starts from external data sources, if it is an AiiDA database one would have to import
it and export the part which one needs and continue from that. Not so nice if one needs only a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence. If you need starting data, which is part of an AiiDA database, why would you need to import it and then export the part of nodes that you are interested in? Do you mean: one would have to export the part that one is interested in and import this into the working database.

If that is what you meant, then I also don't really see the problem here because you can exactly export and then import just the parts you need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but to export he part that I am interested in, I first need to import the whole thing in a temporary database, or not?
It would be nice to be able to do such a thing on AiiDA export files without import maybe. But this is something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now, you are describing the case when you already have an existing AiiDA archive from which you want to import only a part. If you add this specifically, the sentence should be clear.

Comment on lines 22 to 23
small part from a large database. There is also currently no AiiDA intrinsic way to keep the
provenance from other data on the internet which has persistent identifiers and links.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There actually is some concept for this. The Data base class defines the source property, which returns the source attribute if it is defined. Granted, this is all there is and the schema of this source attribute is not very rigorously defined or enforced, but the DbImporters do use this to record the source of external data. For example, CifData imported with the CodDbImporter record the source URL, external persistent data id and version etc. We might also want to investigate if we can improve this concept and make it a bit more rigorous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of this, but this is an interesting fact, because maybe the solution might be then an PIDData type with an basic importer and using the current source property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this feature is actually broken for all data plugins (and has been since v1.0 😨 ), except for UpfData and CifData as I describe in this issue. I have opened a PR to restore this functionality though.

Comment on lines +26 to +27
A lot of work, which every user does different and some not at all, this causes data nodes to appear out
of nowhere, with only 'goodwill' provenance information in the extras, and other annotations set by the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the "appearing out of nowhere" be solved by the proposed data type. By definition if you are importing external data, either by actually importing the content or by including some kind of soft link as this proposal is suggesting, the node is going to appear without preceding provenance, is it not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See usecases under last comment

Maybe add an option to also add this data to the repository, if wanted.

### Proposals
Maybe one could also think here about special subclasses, like specialized for data with PIDs and persistent URLs or AiiDA exports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be specialized about the subclass for data with PIDs and persistent URL? I thought the base concept would require a piece of data with a PID/persistent URL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out above the PIDData would be very basic, provide the data 'as is' to you. If there are some more common usecases for certain data one might subclass the PIDData and add special functions which can deal better with the special given data. (So far I could not think here of such a common important usecase) and in a first implementation I would not go for such things, but it is something which might be kept in mind.

* For external sources one could look at features of [git-annex](https://git-annex.branchable.com/) for example,
which stores a soft link for large files and only downloads them when they are needed.

## Proposed Enhancement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question regarding this part is how would the remote data be made available locally? I imagine that the remote data can be a wide variety of types: just a single file, a zip file, an AiiDA export archive, a plain directory etc.. We would have to think how the content is made available. Does the plugin simply download the content in a temporary folder and give the absolute filepath of the temp folder to the caller? Should the content be mutable (I guess if you give the temp folder you cannot prevent it)? How do you detect changes such that if it is reused again in another calcfunction, the data is still the original. If you have a caching system, you might have to provide a clone of the data when requested and not the original, such that the client cannot mutate the original data. These are just some questions that bubbled up when thinking about this. Just wondering if you already had a bit of an idea of how this would work. Some explicit example use cases might help in mapping this out.

Copy link
Member Author

@broeder-j broeder-j Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.
I would just make the data available as it is by the caller and within the calcfunction he can do whatever he wants (and has to know what kind of data it is) without changing the data.

About ensure no changes:

  1. is is possible to make the downloaded data/temp folder 'read only'?
  2. One could check an md5 hash prior each use... How long does it take to calculate an md5 hash? Maybe not a good idea, since one would like to use this for large files.
  3. Modification time on this is not allowed to change?
  4. Of course these could break if the folder is moved (or on AiiDA export import), but here one could write logic to if the path does not exists, download again.

Example usecases:

  1. Main use case: Starting from a single persistent huge file on the web, I am interested in only a few data points to do further work. Currently, i would store this as a SinglefileData, write a calcfunction to extract that data and continue from there. Where I got the singlefileData from I would store in the extras. The single file data would be in any export unless I exclude it explicitly. if I do not want to store the single file data, I would not run a calcufunction and create the extracted node without connection in the database (with more work a dummy connection) and some information in the extras how they came to be. This is not fully automatized.

The new format would replace this 'SinglefileData' as a softlink in the database and repo, i.e the first time I execute the calcufunction to extract the data, it will be downloaded and I extract the datapoints within a calcfunction with clear provenance (because the 'PIDData is input) the data base how they came to be without doing something 'manual' in addition. So fully automated and nothing generic to my AiIDA usage style.

if I use the same PIDData node again in some other calcufunction it should use the local instance and not download again. If I give this script to somebody else, he can run the same thing and reproduce my work. If I provide an export, I just have my work, plus the PIDData node. People using this export could reuse my PIDData node (would do the download again on their machine).

  1. One wants to run a simple post processing data evaluation on some large data(bases) through AiiDA and record the provenance. Lets say this is one or several AiiDA databases. Currently I would have to import these AiiDA databases into my work database, or smarter as you pointed out, load them in a temporary database export the nodes I need (do I take the full provenance with me here?), and import them into my work database. If I would like the provenance and need all 'last results' child nodes I have the full database in the end in my work database. After this import I would write a calcufunction taking all needed child nodes to create the inputs for the post process I like to do. Here the provenance is really nicely kept, but again if I publish this data, where do I cut the graph? If I cut it, I would again have to manual provide some extra information on where these 'starting' nodes came from and how (i.e the link to the DOI would not be per default in the database, maybe with verdi import url it is?).

This use case is more tricky and I am not sure what best to do here yet (since I think it is not so nice yet).
With the PIDdata, one could do the ‘import export, (delete)’ of the aiida databases inside a calcufunction (with only the PIDData types as input) and return the result nodes. Now I can savely ‘remove this imported database’ without loosing provenance, or I could export now only the small AiiDA graph with the PIDData, calcfuntion and results nodes to import it into my work database.

An other way could also be to extract the needed information from the export files by some other means inside a calcfunction, without doing an import (because the databases are to large maybe).

The same thing could be done also now already (without the PIDData), by writing a calcfunction which takes strings nodes with the PIDs as inputs, does the whole thing and outputs my needed data nodes. But here it would not be clear that the ‘string nodes’ represent external datasources.

  1. use case would be: Data which has certain access rights and I do not want that data to end up in the AiiDA repo or database in the first place. In this case the PIDData (download) would only work for people which machines have the rights to access the remote persistent data, but there would still be some provenance about it in the Database (without cutting AiiDA graphs prior publication). This may not be true, because extraction information will be in the calcfunction, which might contain more information than allowed. On the other hand one can always 'mask' code in the provenance through imports from somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional details and information @broeder-j , helps a lot. I think these are indeed good use-cases and will be very useful. Main question is the implementation of how to retrieve the data locally. I think that it is smart to have a local caching system as to not have to download the data multiple times. However, I also think it should be idempotent when using the data. This would mean that the data actually being made available to a caller is a clone of what is in the cache, such that the cache is guaranteed to not be mutated. I wouldn't go for an attempt to give the caller a direct link to the cache instance but try to make it non-mutable. There will always be some way to circumvent it. So I think the safest is to accept that each invocation will require a clone of the data locally, which is still not as bad as having to redownload the entire thing.

@broeder-j
Copy link
Member Author

Shall I add the use cases to the proposal, with some correction of the other stuff?

@sphuber
Copy link
Contributor

sphuber commented Oct 5, 2021

Shall I add the use cases to the proposal, with some correction of the other stuff?

Yes, it would be great if you could update the text with some of the suggestions and add the use cases which make it more concrete why this would be useful and what the implementation should look like.

@broeder-j
Copy link
Member Author

Any updates on this? Do I still need to do something? @sphuber do you want to champion this?

@sphuber
Copy link
Contributor

sphuber commented Nov 23, 2021

Apologies for the delay @broeder-j , have not had the time to sit down and finalize this. Would it be ok to discuss this during the upcoming coding week (6 - 10 December)? Not sure if you are going to be there, but maybe we could schedule a quick call to hash out any last things. Will also give the opportunity to discuss this with other developers that might be interested.

@sphuber sphuber changed the title Add first version of (web) PIDData proposal AEP: Add data plugin to model remote persistent data Dec 16, 2021
@broeder-j
Copy link
Member Author

Apologies for the delay @broeder-j , have not had the time to sit down and finalize this. Would it be ok to discuss this during the upcoming coding week (6 - 10 December)? Not sure if you are going to be there, but maybe we could schedule a quick call to hash out any last things. Will also give the opportunity to discuss this with other developers that might be interested.

Hi, apologies I was completely out of office already back then and this flew under my radar. I would be happy to finalize this now and get it merged. Happy to schedule a call or what ever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants