-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AEP: Add a schema to ORM classes #40
base: master
Are you sure you want to change the base?
Conversation
Heya cheers, My main question though would be; what about the object repository of data classes? Also, the |
Good point, I had thought about this indeed, but kind of lost sight of this the last day when working towards this AEP. I will see if I can come up with a proposal to solve it.
True, and I already have corrected this for some. But I would say that this is a "mistake" of the ORM constructor. For example, the |
Alright, I have a potential solution. It is possible to represent repository content simply as a dictionary of bytes with the key being the relative filepath in the repo. Given that only the repository_content: Optional[dict[str, bytes]] = MetadataField(
None,
description='Dictionary of file repository content',
orm_to_model=lambda node: node.base.repository.serialize_content(),
) The def serialize_content(self) -> dict[str, bytes]:
"""Serialize the content of the repository content into a JSON-serializable format.
:return: dictionary with the content metadata.
"""
serialized = {}
for dirpath, _, filenames in self.walk():
for filename in filenames:
filepath = dirpath / filename
serialized[str(filepath)] = self.get_object_content(filepath, mode='rb')
return serialized The final modification required was to override @classmethod
def from_model(cls, model: Model) -> 'Node':
"""Return an entity instance from an instance of its model."""
fields = cls.model_to_orm_field_values(model)
repository_content = fields.pop('repository_content', {})
node = cls(**fields)
for filepath, content in repository_content.items():
node.base.repository.put_object_from_bytes(content, filepath)
return node With these changes, it is now possible to create data nodes using the simple REST API endpoint I implemented in this AEP:
Two big questions come to mind straight away:
I think that we cannot go the road of the Taking the class Model(Data.Model):
content: bytes = MetadataField(
description='The file content.', model_to_orm=lambda content: io.BytesIO(content)
)
filename: t.Optional[str] = MetadataField(None, description='The filename. Defaults to `file.txt`.') This works just fine, except for a few modifications to the Another point is that when serializing a
This works for As another test, I added support for class ArrayData(Data):
class Model(Data.Model):
model_config = ConfigDict(arbitrary_types_allowed=True)
arrays: Optional[dict[str, ndarray]] = MetadataField(None, description='The dictionary of numpy arrays.') Demonstration of functionality: In [1]: from aiida.orm import ArrayData
In [2]: import numpy as np
In [3]: array = ArrayData(arrays=np.array([1, 2, 3]))
In [4]: array.to_model()
Out[4] Model(pk=None, uuid='47fb4799-e9f7-4a83-a9f5-580abb6250b7', node_type='data.core.array.ArrayData.', process_type=None, repository_metadata={}, ctime=datetime.datetime(2024, 1, 19, 15, 45, 45, 38034, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600), 'CET')), mtime=None, label='', description='', extras={}, computer=None, user=1, repository_content={'default.npy': b"\x93NUMPY\x01\x00v\x00{'descr': '<i8', 'fortran_order': False, 'shape': (3,), } \n\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00"}, source=None, arrays={'default': array([1, 2, 3])}) Note that here once again we would want to remove the inherited |
We can't change aiida plugins though (with custom data classes) 😬, and since this constraint was never stipulated, probably most of them will be broken. Plus just in general
A question would be; can plugin developer subclass the
I think you are missing a step here, in interchanging binary with strings without any explicit conversion?
I feel, you should never be storing arbitrary things in attributes, because that kinda defeats the purpose of having data subclasses 😅 I think ideally
We already do: https://github.com/aiidateam/aiida-restapi/blob/3f14ea2ac88e06ce9ce8d4cfb02ec16fea0463f7/aiida_restapi/routers/nodes.py#L68
Yeh, not to say that it can't be useful 👍, but it is certainly a major caveat, that should be carefully considered. |
Fair enough. First off, the number of data plugins out there is quite limited. Additionally, I am not saying that we are promising that this new functionality will work with all plugins out there, out of the box. I also don't think we have to, because all of this functionality is additive and existing functionality remains completely unaffected. Even if no external data plugins would be compatible with the new functionality, just having this for all core ORM classes would be a huge win. That being said...
yes, if their constructor does not accept all the fields defined by the model, and they cannot or don't want to change the constructor signature, they can simply override the Still, I think that in most cases, the fields of the model not lining up with the constructor of the class is indicative of a design mistake. There are valid exceptions, for example, where certain fields are derived from other files. Let's say field
Not quite sure what you mean with this. Do you mean that it assumes the encoding. But you are right, the way I do it now is incorrect. It works for the examples I added which add simple text files, but for binary data it is probably not correct. I changed the class Node(Entity['BackendNode', NodeCollection], metaclass=AbstractNodeMeta):
class Model(Entity.Model):
repository_content: Optional[dict[str, bytes]] = MetadataField(
None,
description='Dictionary of file repository content. Keys are relative filepaths and values are binary file '
'contents encoded as base64.',
orm_to_model=lambda node: {key: base64.encodebytes(content) for key, content in node.base.repository.serialize_content().items()},
)
@classmethod
def from_model(cls, model: Model) -> 'Node':
"""Return an entity instance from an instance of its model."""
fields = cls.model_to_orm_field_values(model)
repository_content = fields.pop('repository_content', {})
node = cls(**fields)
for filepath, encoded in repository_content.items():
node.base.repository.put_object_from_bytes(base64.decodebytes(encoded), filepath)
return node So the byte content of the repo is base64 encoded when going from ORM to the pydantic model, which will allow it to be sent over the wite. The
Something similar can be implemented for the class ArrayData(Data):
class Model(Data.Model):
model_config = ConfigDict(arbitrary_types_allowed=True)
arrays: Optional[dict[str, bytes]] = MetadataField(
None,
description='The dictionary of numpy arrays.',
orm_to_model=lambda node: ArrayData.save_arrays(node.arrays),
model_to_orm=lambda value: ArrayData.load_arrays(value)
)
@staticmethod
def save_arrays(arrays: dict[str, np.ndarray]) -> dict[str, bytes]:
results = {}
for key, array in arrays.items():
stream = io.BytesIO()
np.save(stream, array)
stream.seek(0)
results[key] = base64.encodebytes(stream.read())
return results
@staticmethod
def load_arrays(arrays: dict[str, bytes]) -> dict[str, np.ndarray]:
results = {}
for key, encoded in arrays.items():
stream = io.BytesIO(base64.decodebytes(encoded))
stream.seek(0)
results[key] = np.load(stream)
return results used as follows: In [1]: import numpy as np
from aiida.orm import ArrayData
In [2]: a = ArrayData(np.array([[1,2], [3,4]]))
In [3]: a.serialize()
Out[3]:
{'pk': None,
'uuid': '820ed501-55b7-4440-9eea-f22681d1805e',
'node_type': 'data.core.array.ArrayData.',
'process_type': None,
'repository_metadata': {},
'ctime': datetime.datetime(2024, 1, 19, 20, 28, 36, 772718, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600), 'CET')),
'mtime': None,
'label': '',
'description': '',
'extras': {},
'computer': None,
'user': 1,
'repository_content': None,
'source': None,
'arrays': {'default': b'k05VTVBZAQB2AHsnZGVzY3InOiAnPGk4JywgJ2ZvcnRyYW5fb3JkZXInOiBGYWxzZSwgJ3NoYXBl\nJzogKDIsIDIpLCB9ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg\nICAgICAgICAgICAgIAoBAAAAAAAAAAIAAAAAAAAAAwAAAAAAAAAEAAAAAAAAAA==\n'}}
In [4]: r = ArrayData.from_serialized(**a.serialize())
In [5]: r.get_array()
Out[5]:
array([[1, 2],
[3, 4]])
Fair enough. I also kept that as is and didn't add the
I agree that it should not be possible to set this directly, because that would lead to inconsistencies. But I think it is very useful to be able to retrieve from an existing node. But in the current implementation, its field defines
I guess the multi-part file upload is indeed the typical approach there. But here it is just implemented for an explicit endpoint for
Do you have an idea what such a full solution might look like? If there is a better solution, I would also prefer that. |
7965597
to
66038e2
Compare
AiiDA's Python API provides an object relational mapper (ORM) that abstracts the various entities that can be stored inside the provenance graph and the relationships between them. In most use cases, users use this ORM directly in Python to construct new instances of entities and retrieve existing ones, in order to get access to their data and manipulate it. A current shortcoming of the ORM is that it is not possible to programmatically introspect the schema of each entity: that is to say, what data each entity stores. This makes it difficult for external applications to provide interfaces to create and or retrieve entity instances. It also makes it difficult to take the data outside of the Python environment since the data would have to be serialized. However, without a well-defined schema, doing this without an ad-hoc solution is practically impossible. Clear data schemas for all ORM entities would enable the creation of external applications to work with the data stored in AiiDA's provenance graph. A typical example use case would be a web API whose interface, to create and return ORM entities from an AiiDA profile, would be dynamically generated by programmatically introspecting the schema of all ORM entities stored within it. Currently, the interface has to be manually generated for each ORM entity, and data (de)serialization has to be implemented.
66038e2
to
983a645
Compare
Hi, quick comment, and maybe there are multiple usecases and not all can be covered with the same technical solution. For the API point of view, I would suggest not to have the files base64-encoded in the responses, but just links to other links to download them. This would allow direct streaming of files over HTTP that can be downloaded in chunks etc with standard clients. This could either be obtained by providing links instead of content for every file, or (if a standard URL is decided) one could just return the list of folders/files in the repo, and the client would use the info to get the file (e.g. something like If we need to serialise to file, instead, (i.e. not for the REST API), also in this case I would think to an alternative format than JSON. E.g. a .zip wherein a JSON file is still used for the attributes, but files are stored as files, similar to the .aiida files. Maybe there is however another usecase I'm missing? |
Fair point about memory usage, for large files it certainly won't be a viable option and a solution is needed that allows for streaming the file when downloading. For retrieving node files this won't be that much of a problem. We can always add additional endpoints for this purpose. What is more tricky I think is the node creation end point. There the files have to be sent with the rest of the data because the call needs to create the node in one go. It cannot first create the node in one call and then add files, because it will already have been stored and be immutable. Here, having the option of passing the file content as serialized strings works quite nicely as long as they are not too big. Not sure yet how to approach the case where files are too big for memory. |
AiiDA's Python API provides an object relational mapper (ORM) that abstracts the various entities that can be stored inside the provenance graph and the relationships between them. In most use cases, users use this ORM directly in Python to construct new instances of entities and retrieve existing ones, in order to get access to their data and manipulate it.
A current shortcoming of the ORM is that it is not possible to programmatically introspect the schema of each entity: that is to say, what data each entity stores. This makes it difficult for external applications to provide interfaces to create and or retrieve entity instances. It also makes it difficult to take the data outside of the Python environment since the data would have to be serialized. However, without a well-defined schema, doing this without an ad-hoc solution is practically impossible.
Clear data schemas for all ORM entities would enable the creation of external applications to work with the data stored in AiiDA's provenance graph. A typical example use case would be a web API whose interface, to create and return ORM entities from an AiiDA profile, would be dynamically generated by programmatically introspecting the schema of all ORM entities stored within it. Currently, the interface has to be manually generated for each ORM entity, and data (de)serialization has to be implemented.
I have already implemented the proposal, which can be found here: https://github.com/sphuber/aiida-core/tree/feature/orm-pydantic-models
All tests pass, so it seems that these changes would be perfectly backwards compatible. There were a few changes in the existing API that were necessary, but these should also be backwards compatible. Essentially these were ORM class constructors not allowing to specify all fields directly, such as extras, or the frontend/backend ORM classes didn't implement properties for all fields.