-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging some pythonutils parts up to RDFlib #92
Comments
@nicholascar Yes, I would love to get some of these into core. The DeterministicTurtleSerializer has a lot to offer over the default serializer. It is the one that would be by far of most use to the community and is currently implemented in CustomTurtleSerializer pyontutils/ttlser/ttlser/serializers.py Line 155 in f57b769
There is at least one known error that must be fixed before we merge. Some malformed nt graphs can induce an infinite loop here pyontutils/ttlser/ttlser/serializers.py Line 105 in f57b769
rdflib/plugins/serializers/dturtle.py if we don't want to put them directly in rdflib/plugins/serializers/turtle.py or something like that? The only reason I hesitate is because of the helper functions and classes that are used to implement the deterministic sort.
There are some other considerations that will need to be taken, such as the default value for predicateOrder. The other one that would probably be of use is the HttpTurtleSerializer pyontutils/ttlser/ttlser/serializers.py Line 808 in f57b769
newline and space are abstracted. CompactTurtleSerializer probably should not go in. It is a completely non-standard implementation of a bad compression algorithm using the serializer that I created in order to come up with a realistic worst possible case when I was testing the performance of the (then new) trie based namespace system. Using a sane serialization with gzip is the right thing to do. The SubClassOfTurtleSerializer is broken and not implemented correctly so is not ready to go in even if it might be of use to people who would like to have a bit more semantic ordering than purely syntactical ordering as is currently implemented by the DeterministicTurtleSerializer. |
@tgbugs Great, well we - my company - has an interest here beyond just "make RDFlib better" in that we too use lots of version controlled turtle files as feedstock for RDF graphs and would love to see this deterministic serialiser implemented to help with that. In about 2013, I created a reasonably deterministic turtle serialiser on top of RDFlib that did some out-of-band (i.e. outside the graph) counting to order BNs etc. but it wasn't heavily tested and I've lost it! I'm happier to help with your implementation which looks miles ahead of where I got to! Q1: can we shift from the older Q2: any idea how far off an N3 serialiser this is? I just wonder two things: Q3: as above but for Trig - how close are these serializers to being able to handle Trig? |
The more people that can make use of it the better! The only way I was able to get it to work at all was to try to come up with the most pathological tests cases I could imagine, and I'm sure there are still some lurking out there. Q1: I'm fairly certain that Q2: No idea. https://www.w3.org/TeamSubmission/n3/#subsets for reference. If I had to guess literal subjects, rdf paths rules and formulae are all not supported. No idea what it would take to get there since I haven't interact with n3 much at all. However where I have interacted with it is in https://github.com/RDFLib/rdflib/blob/master/test/n3/example-lots_of_graphs.n3. That definitely won't serialize. Minimally it would require an additional rule for ranking whole serialized graphs as well as expressions. Q3: Off the top of my head no idea here as well. The named graphs will definitely give it trouble. https://www.w3.org/TR/trig/#grammar-ebnf |
OK, so we add a new serialization option, See the ticket Issue #1207 that my engineer @jamiefeiss is working on. I think this would go nicely with the deterministic serializer. You can think of RDF (dturtle) files in Git and then the a Graph being made from that for use. Haven't worked out all the little bits yet but, at the very least, the new file |
The options would seem to be |
Yes, I think both We might bypass the infinite loop issue you know about by an input requirement check? Since this will be a secondary and optional Turtle serialiser, it can have some conditions of use (i.e. not handle absolutely all, edge case, RDF)? |
See https://www.w3.org/TR/turtle/#h3_sec-iri Currently RDFlib 5.0.0 supports both in parsing but serializes to
I think serialising to Trig will be much simpler than catering for N3. I think the serializer just have to be context aware, i.e. if being called on a |
@tgbugs how does the determanistic serializer relate/not relate to RDF normalization algorithms like https://json-ld.github.io/normalization/spec/? I always assumed that if a graph could be converted into normalized form then determanistic serialization, in Turtle or JSON-LD or RDF/XML etc would be "easy", i.e. the serialiser could just "start at the start and go to the end" |
@nicholascar The deterministic ttl serializer was written before I was thinking about normalization or graph identity at all. I also didn't manage to become aware of the normalization algorithm until fairly recently, at which point in time I had already implemented https://github.com/tgbugs/pyontutils/blob/master/pyontutils/identity_bnode.py which provides mostly stable identities for any bnode or graph. When I was first implementing ibnode I tried to reuse the deterministic serialization code, however it was very difficult to adapt, so I wrote the identity from scratch. I haven't had an opportunity to review and compare ibnode and the deterministic serializer the rdf normalization spec as I only became aware of it when I started to dig into json-ld in the last year, but it is likely that the ibnode implementation is quite similar in some respects, but I can't say more than that without comparing the algorithms more closely. What I can say is that just having some normalized form is no guarantee that you can serialize the graph in the way you want. The deterministic serializer works with any deterministic total ordering on the triples in the graph, which is part of why I was working on the subClassOf serializer. The issue with trying to go from the normalized form of a graph is that it provides only one total ordering that may or may not be the one that you want. The normalization approach only deals with the normalized form of the graph and as far as I can tell doesn't provide any guidance about what to do about the compacted form. The deterministic serializer was intentionally designed to handle the compacted form since that is what humans end up seeing in version control systems and having the compact identifiers subject to expanded ordering is extremely confusing. In terms of start at the start and go to the end, yes, if you already have everything in order already. But if it is not in the order you want you will have to reorder it. In my original use case I was accounting for the fact that any change to the prefixes could (and often does) result in a reordering. If you don't care what total order your triples are in, you could store them sorted by the normalization and then serialization would incidentally be deterministic. The issue is what happens if someone changes something in the implementation that has a side effect of changing the serialization ordering. |
@nicholascar I finally got around to fixing the issues with the deterministic serializer and cut a release https://github.com/tgbugs/pyontutils/releases/tag/ttlser-1.1.4. I think we can revisit the integration question now. |
@tgbugs: @aucampia is dead keen to improve, perhaps even replace, serializers, so please communicate with him on this on. Also, I have a student working on Turtle-star processing right now (work in https://github.com/RDFLib/rdflib-rdfstar/) and there he's using a Lark-based parser for Turtle-star which then converts Turtle-star to Turtle. It's probably sensible to move to a Lark-based parser for all Turtle as well as Turtle-star. |
Hi @tgbugs,
In the README for ttlset you state "ttlser cannot produce deterministic results without the changes added in RDFLib/rdflib#649". Well 649's been merged for some time now.
Are you interested in getting this family of serializers into RDFlib core? If they are stable and have value over the default Turtle serializer, surely it would be good to have them as serializer options in core?
Cheers, Nick
The text was updated successfully, but these errors were encountered: