-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from pickled blobs to JSON data #1786
base: master
Are you sure you want to change the base?
Conversation
If we are moving from BLOBs to JSON then we should really use the new format. See PR #800. The new format uses the The main benefit of the new format is that it is easier maintain and debug. Instead of lists we use dictionaries. So, for example, we refer to the field "parent_family_list" instead of field number 9. Upgrades are no problem. We just read and write the raw data. When I have more time I'll update you on discussion whilst you have been away. |
Oh, that sounds like a great idea! I'll take a look at the JSON format and switch to that. Should work even better with the SQL JSON_EXTRACT(). |
There are a few places where the new format is used, so we will get some bonus performance improvements. Feel free to make changes to my existing code if you see a benefit. You may also want to have a quick look at how we serialize |
Making some progress. Turns out, the serialized format had leaked into many other places, probably for speed. Probably good candidates for business logic. |
I added a |
@Nick-Hall , I will probably need your assistance regarding the complete save/load of the to_json and from_json functions. I looked at your PR but as it touches 590 files, there is a lot there. In this PR, I can now upgrade a database, and load the people views (except for name functions which I have to figure out). |
Thanks @Nick-Hall, that was very useful. I think that I will cherry pick some of the changes (like attribute name changes, elimination of private attributes). You'll see that I did many of the same changes you made. But, one thing I found is that if we want to allow upgrades from previous versions, then we need to be able to read in blob_data, and write out json_data. I think my version has that covered. I'll continue to make progress. |
@dsblank Why are you removing the properties? The validation in the setters will no longer be called. |
@Nick-Hall , I thought that was what @prculley did for optimization, and I thought was needed. I can put those back :) |
Perhaps we could consider a solution similar to that provided by the pickle A A I expect that only a handful of classes would need to override the default methods. |
Co-authored-by: stevenyoungs <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the way you are upgrading the database will cause problems in the future. Suppose a user is upgrading from schema version 20 to some future version, say 24. In the upgrade from 20 to 21, an object serialized in version 20 will be unserialized to create a version 24 object, which may possibly result in an error. Then the object will be converted to version 24 JSON which will be used in the upgrade from 21 to 22. If a field is added in version 24, then it will appear in the upgrade path before it is expected. Renamed and deleted object attributes will also cause problems.
We don't instantiate objects in the database upgrades for this reason as well as for better upgrade performance.
Apart from this one issue everything in the PR looks good.
I see what you mean. Yes, will change. I'll leave the serializer to read the blob data, but you are right: I'll have to manually convert the unpickled blob data to the current JSON format. |
@Nick-Hall, that took a bit of work to convert the blob array directly to JSON, but it is the correct way to do it. And actually found a bug in the to_json() infrastructure: Surname was missing the object_state. One tricky thing was to make sure to include the defaults that are set the constructors. I think this is correct. I went through all of the data and checked the conversions and they match on all values. Now that we aren't creating objects in the upgrade code, we can remove all of the serialize/unserialize code as it isn't used. But I'll do that after this PR. |
# First, do metadata: | ||
|
||
self._txn_begin() | ||
self.set_serializer("blob") | ||
self.upgrade_table_for_json_data("metadata") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# First, do metadata: | |
self._txn_begin() | |
self.set_serializer("blob") | |
self.upgrade_table_for_json_data("metadata") | |
self.set_serializer("blob") | |
self._txn_begin() | |
# First, do metadata: | |
self.upgrade_table_for_json_data("metadata") |
Is there any reason to have the call to set_serializer inside the transaction?
Move the comment closer to the code. Comment may be superfluous, unless you want to expand on why you do the metadata first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to have the call to set_serializer inside the transaction?
No, not really.
Is it worth writing a test to roundtrip all objects into and back out of json representation to catch any other, or future, problems? |
👏 that was a tedious bit of code to write 😄 |
Sorta... once the pickled blobs are gone, there won't be anything to test. But it is probably worth adding a script for people to run now to verify that this works. I'll do that. |
@Nick-Hall, there is one part of the conversion we may have to deal with in the future: we used to store a pickled Researcher instance in the metadata. If the Researcher class changes, pickle might not be able to load it, and we would lose the info. We don't have to do anything right now, but we may have to deal with that someday. Other than that, I think this PR is done. |
@dsblank OK. Thanks. I'll have a look at it tomorrow. |
After the comment in your other PR #1791 by @Call-Me-Dave about using the MessagePack format, I've been thinking more about the Updating the My thoughts about the next release is perhaps it should be a major 6.0 release. Although most people won't be using the "raw" database API methods, they will change significantly. We should also probably limit the release to your PRs on the backend, but otherwise allow code quality changes (such as David's typing PRs) and front-end updates only. I'll post to the gramps-devel mailing list to get feedback. |
This PR converts the database interface to use JSON data rather than the pickled blobs used since the early days.
db.serializer
a. abstracts data column name
b. contains serialize/unserialize functions
a. It does this by switching between serializers