AEP: Add data plugin to model remote persistent data #27

broeder-j · 2021-10-04T08:15:29Z

sphuber

Thanks @broeder-j . Think this is a very interesting idea and has a lot of promise to be useful. I have left some initial questions and comments.

sphuber · 2021-10-04T21:41:59Z

006_web_pid_datatype/readme.md

+Currently, AiiDA is not so good in keeping track of the provenance of large files and
+of data which comes from the outside but is persistent.
+
+For example: Large files will be duplicated on disk, for every minor change to them. 
+If a file should be part of the provenance it has to be stored in the repository also.


Granted, when some piece of data that is stored in the repository is changed marginally, the content is currently duplicated (even under the new repository which will at least not duplicate copies of the exact same file). However, the proposed data plugin doesn't really solve that issue, does it? It merely prevents having to store the large data in the first place. Still a valid point, but find that this introduction objection is a bit misleading.

Right, it would only solve the part of large external input data, from which I need little. I just wrote this to point this out. Because if there is a 'general' solution for large files, one may not have need for such a 'PIDData' class anymore.

sphuber · 2021-10-04T21:45:54Z

006_web_pid_datatype/readme.md

+If one starts from external data sources, if it is an AiiDA database one would have to import
+it and export the part which one needs and continue from that. Not so nice if one needs only a


I don't understand this sentence. If you need starting data, which is part of an AiiDA database, why would you need to import it and then export the part of nodes that you are interested in? Do you mean: one would have to export the part that one is interested in and import this into the working database.

If that is what you meant, then I also don't really see the problem here because you can exactly export and then import just the parts you need.

Right, but to export he part that I am interested in, I first need to import the whole thing in a temporary database, or not?
It would be nice to be able to do such a thing on AiiDA export files without import maybe. But this is something else.

Ah I see now, you are describing the case when you already have an existing AiiDA archive from which you want to import only a part. If you add this specifically, the sentence should be clear.

sphuber · 2021-10-04T21:48:36Z

006_web_pid_datatype/readme.md

+small part from a large database. There is also currently no AiiDA intrinsic way to keep the
+provenance from other data on the internet which has persistent identifiers and links.


There actually is some concept for this. The Data base class defines the source property, which returns the source attribute if it is defined. Granted, this is all there is and the schema of this source attribute is not very rigorously defined or enforced, but the DbImporters do use this to record the source of external data. For example, CifData imported with the CodDbImporter record the source URL, external persistent data id and version etc. We might also want to investigate if we can improve this concept and make it a bit more rigorous.

I was not aware of this, but this is an interesting fact, because maybe the solution might be then an PIDData type with an basic importer and using the current source property.

I just realized that this feature is actually broken for all data plugins (and has been since v1.0 😨 ), except for UpfData and CifData as I describe in this issue. I have opened a PR to restore this functionality though.

sphuber · 2021-10-04T21:49:54Z

006_web_pid_datatype/readme.md

+A lot of work, which every user does different and some not at all, this causes data nodes to appear out
+of nowhere, with only 'goodwill' provenance information in the extras, and other annotations set by the user.


How would the "appearing out of nowhere" be solved by the proposed data type. By definition if you are importing external data, either by actually importing the content or by including some kind of soft link as this proposal is suggesting, the node is going to appear without preceding provenance, is it not?

See usecases under last comment

sphuber · 2021-10-04T21:53:08Z

006_web_pid_datatype/readme.md

+Maybe add an option to also add this data to the repository, if wanted.
+
+### Proposals
+Maybe one could also think here about special subclasses, like specialized for data with PIDs and persistent URLs or AiiDA exports. 


What would be specialized about the subclass for data with PIDs and persistent URL? I thought the base concept would require a piece of data with a PID/persistent URL?

As you pointed out above the PIDData would be very basic, provide the data 'as is' to you. If there are some more common usecases for certain data one might subclass the PIDData and add special functions which can deal better with the special given data. (So far I could not think here of such a common important usecase) and in a first implementation I would not go for such things, but it is something which might be kept in mind.

sphuber · 2021-10-04T22:02:32Z

006_web_pid_datatype/readme.md

+* For external sources one could look at features of [git-annex](https://git-annex.branchable.com/) for example,
+which stores a soft link for large files and only downloads them when they are needed. 
+
+## Proposed Enhancement


My main question regarding this part is how would the remote data be made available locally? I imagine that the remote data can be a wide variety of types: just a single file, a zip file, an AiiDA export archive, a plain directory etc.. We would have to think how the content is made available. Does the plugin simply download the content in a temporary folder and give the absolute filepath of the temp folder to the caller? Should the content be mutable (I guess if you give the temp folder you cannot prevent it)? How do you detect changes such that if it is reused again in another calcfunction, the data is still the original. If you have a caching system, you might have to provide a clone of the data when requested and not the original, such that the client cannot mutate the original data. These are just some questions that bubbled up when thinking about this. Just wondering if you already had a bit of an idea of how this would work. Some explicit example use cases might help in mapping this out.

This is a good point.
I would just make the data available as it is by the caller and within the calcfunction he can do whatever he wants (and has to know what kind of data it is) without changing the data.

About ensure no changes:

is is possible to make the downloaded data/temp folder 'read only'?

One could check an md5 hash prior each use... How long does it take to calculate an md5 hash? Maybe not a good idea, since one would like to use this for large files.

Modification time on this is not allowed to change?

Of course these could break if the folder is moved (or on AiiDA export import), but here one could write logic to if the path does not exists, download again.

Example usecases:

Main use case: Starting from a single persistent huge file on the web, I am interested in only a few data points to do further work. Currently, i would store this as a SinglefileData, write a calcfunction to extract that data and continue from there. Where I got the singlefileData from I would store in the extras. The single file data would be in any export unless I exclude it explicitly. if I do not want to store the single file data, I would not run a calcufunction and create the extracted node without connection in the database (with more work a dummy connection) and some information in the extras how they came to be. This is not fully automatized.

The new format would replace this 'SinglefileData' as a softlink in the database and repo, i.e the first time I execute the calcufunction to extract the data, it will be downloaded and I extract the datapoints within a calcfunction with clear provenance (because the 'PIDData is input) the data base how they came to be without doing something 'manual' in addition. So fully automated and nothing generic to my AiIDA usage style.

if I use the same PIDData node again in some other calcufunction it should use the local instance and not download again. If I give this script to somebody else, he can run the same thing and reproduce my work. If I provide an export, I just have my work, plus the PIDData node. People using this export could reuse my PIDData node (would do the download again on their machine).

One wants to run a simple post processing data evaluation on some large data(bases) through AiiDA and record the provenance. Lets say this is one or several AiiDA databases. Currently I would have to import these AiiDA databases into my work database, or smarter as you pointed out, load them in a temporary database export the nodes I need (do I take the full provenance with me here?), and import them into my work database. If I would like the provenance and need all 'last results' child nodes I have the full database in the end in my work database. After this import I would write a calcufunction taking all needed child nodes to create the inputs for the post process I like to do. Here the provenance is really nicely kept, but again if I publish this data, where do I cut the graph? If I cut it, I would again have to manual provide some extra information on where these 'starting' nodes came from and how (i.e the link to the DOI would not be per default in the database, maybe with verdi import url it is?).

This use case is more tricky and I am not sure what best to do here yet (since I think it is not so nice yet).
With the PIDdata, one could do the ‘import export, (delete)’ of the aiida databases inside a calcufunction (with only the PIDData types as input) and return the result nodes. Now I can savely ‘remove this imported database’ without loosing provenance, or I could export now only the small AiiDA graph with the PIDData, calcfuntion and results nodes to import it into my work database.

An other way could also be to extract the needed information from the export files by some other means inside a calcfunction, without doing an import (because the databases are to large maybe).

The same thing could be done also now already (without the PIDData), by writing a calcfunction which takes strings nodes with the PIDs as inputs, does the whole thing and outputs my needed data nodes. But here it would not be clear that the ‘string nodes’ represent external datasources.

use case would be: Data which has certain access rights and I do not want that data to end up in the AiiDA repo or database in the first place. In this case the PIDData (download) would only work for people which machines have the rights to access the remote persistent data, but there would still be some provenance about it in the Database (without cutting AiiDA graphs prior publication). This may not be true, because extraction information will be in the calcfunction, which might contain more information than allowed. On the other hand one can always 'mask' code in the provenance through imports from somewhere.

Thanks for the additional details and information @broeder-j , helps a lot. I think these are indeed good use-cases and will be very useful. Main question is the implementation of how to retrieve the data locally. I think that it is smart to have a local caching system as to not have to download the data multiple times. However, I also think it should be idempotent when using the data. This would mean that the data actually being made available to a caller is a clone of what is in the cache, such that the cache is guaranteed to not be mutated. I wouldn't go for an attempt to give the caller a direct link to the cache instance but try to make it non-mutable. There will always be some way to circumvent it. So I think the safest is to accept that each invocation will require a clone of the data locally, which is still not as bad as having to redownload the entire thing.

broeder-j · 2021-10-05T09:27:00Z

Shall I add the use cases to the proposal, with some correction of the other stuff?

sphuber · 2021-10-05T11:51:13Z

Shall I add the use cases to the proposal, with some correction of the other stuff?

Yes, it would be great if you could update the text with some of the suggestions and add the use cases which make it more concrete why this would be useful and what the implementation should look like.

broeder-j · 2021-11-17T07:59:42Z

Any updates on this? Do I still need to do something? @sphuber do you want to champion this?

sphuber · 2021-11-23T10:35:23Z

Apologies for the delay @broeder-j , have not had the time to sit down and finalize this. Would it be ok to discuss this during the upcoming coding week (6 - 10 December)? Not sure if you are going to be there, but maybe we could schedule a quick call to hash out any last things. Will also give the opportunity to discuss this with other developers that might be interested.

broeder-j · 2022-08-22T07:35:33Z

Apologies for the delay @broeder-j , have not had the time to sit down and finalize this. Would it be ok to discuss this during the upcoming coding week (6 - 10 December)? Not sure if you are going to be there, but maybe we could schedule a quick call to hash out any last things. Will also give the opportunity to discuss this with other developers that might be interested.

Hi, apologies I was completely out of office already back then and this flew under my radar. I would be happy to finalize this now and get it merged. Happy to schedule a call or what ever.

broeder-j added 3 commits October 4, 2021 10:14

Add first version of (web) PIDData proposal

809d1f0

Add proposal to README.md, status submitted and correct typo

f0e2ad5

Correct typo

10e3544

sphuber reviewed Oct 4, 2021

View reviewed changes

Add use cases to proposal

29a6625

sphuber changed the title ~~Add first version of (web) PIDData proposal~~ AEP: Add data plugin to model remote persistent data Dec 16, 2021

sphuber added status/submitted type/standard Standard Track AEP labels Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AEP: Add data plugin to model remote persistent data #27

AEP: Add data plugin to model remote persistent data #27

broeder-j commented Oct 4, 2021 •

edited

Loading

sphuber left a comment

sphuber Oct 4, 2021

broeder-j Oct 5, 2021 •

edited

Loading

sphuber Oct 4, 2021

broeder-j Oct 5, 2021

sphuber Oct 5, 2021

sphuber Oct 4, 2021

broeder-j Oct 5, 2021

sphuber Oct 5, 2021

sphuber Oct 4, 2021

broeder-j Oct 5, 2021

sphuber Oct 4, 2021

broeder-j Oct 5, 2021

sphuber Oct 4, 2021

broeder-j Oct 5, 2021 •

edited

Loading

sphuber Oct 5, 2021

broeder-j commented Oct 5, 2021

sphuber commented Oct 5, 2021

broeder-j commented Nov 17, 2021

sphuber commented Nov 23, 2021

broeder-j commented Aug 22, 2022

		If one starts from external data sources, if it is an AiiDA database one would have to import
		it and export the part which one needs and continue from that. Not so nice if one needs only a

		small part from a large database. There is also currently no AiiDA intrinsic way to keep the
		provenance from other data on the internet which has persistent identifiers and links.

		A lot of work, which every user does different and some not at all, this causes data nodes to appear out
		of nowhere, with only 'goodwill' provenance information in the extras, and other annotations set by the user.

AEP: Add data plugin to model remote persistent data #27

Are you sure you want to change the base?

AEP: Add data plugin to model remote persistent data #27

Conversation

broeder-j commented Oct 4, 2021 • edited Loading

sphuber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

broeder-j Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

broeder-j Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

broeder-j commented Oct 5, 2021

sphuber commented Oct 5, 2021

broeder-j commented Nov 17, 2021

sphuber commented Nov 23, 2021

broeder-j commented Aug 22, 2022

broeder-j commented Oct 4, 2021 •

edited

Loading

broeder-j Oct 5, 2021 •

edited

Loading

broeder-j Oct 5, 2021 •

edited

Loading