-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbling of reserved characters when reading in/updating a dataset #315
Comments
@cboettig I took a quick peek at this with @eeerika and noticed that emld is treating |
@mbjones been a while so I could be mistaken, but... most of EML translates fine into key-value pairs for json, list, etc, but I think we could wrap it something like: unescape_xml <- function(str){
xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
} to repair the text? (untested, not sure if that would round-trip safely either...) |
I see. Seems you are trying to preserve the docbook elements in the text value, even though they are just xml nodes just like all of the other elements. In emld, it might be cleaner to not differentiate
to this:
|
@mbjones one of the goals in
is not particularly readable to a human, nor does it make semantic sense if serialized from JSON-LD to RDF. (and an R user would be very confused by seeing this parsed into the list structure of We also have an ordering problem. key-value pairs are an unordered representation. (And semantic data practices tell us that information should be explicit, not encoded in order, right?) I think it is much better to treat text strings separately from other nodes. I think we can xml_unescape them, and we probably need to then also re xml_escape them when going the other way. In this way we provide the 'raw markup' which happens to be docbook-esque in this case, but also matches the raw markup you'd get using another markup language for text strings, like the new markdown format. There's plenty of examples of JSON-LD out there that have HTML markup in the value strings, as you know, and that seems to be the pattern we should follow. |
@cboettig I'm entirely convinced that your way is best, even if it requires some convoluted escaping. The ordering problem is particularly compelling on the JSON side, even though its not an issue in XML, and complicates the translation considerably. Thanks for clarifying. I suspect the escaping is going to be more difficult than it seems on the surface. In particular, in this scenario, some disallowed characters would be escaped in the XML, and some would not. TL;DR: A whole bunch of messed up examples follow... Here's an example that illustrates the problem we'll have to deal with, starting with a valid EML snippet we might find in the wild: <purpose>
<para>So long as weak <strong, and we <strong>properly</strong> treat ampersand (&) characters</para>
</purpose> When emld reads this according to your approach, I think
So note in that case that we did unescape the XML Entities, making them look like the In addition, all of the following are equivalent from XML's perspective: <para>So long as weak <strong, and we <strong>properly</strong> treat ampersand (&) characters</para>
<para>So long as weak <strong, and we <strong>properly</strong> treat ampersand (&) characters</para>
<para>So long as weak <strong, and we <strong > properly</strong > treat ampersand (&) characters</para>
<para>So long as weak <strong, and we <strong
>
properly</strong
> treat ampersand (&) characters</para> Things like that last one with newlines inside the element tag and the previous one with spaces give me pause for using plain string parsing to determine whether or not to escape it. After we unescape it into R, we might see something like this in R:
When escaping that string, we'd have to escape the first We should certainly include some encoding/decoding tests that exercise these funky formatted XML snippets. Here are some R text strings I think we should test with (
|
The output of
doc$dataset$abstract$para
shows the "&" as "&" when it should just show "&" instead. This results in the "&" showing as "&amp;" on the dataset's web page once updated.The text was updated successfully, but these errors were encountered: