-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFC normalize strings? #1379
Comments
That section of the NUG only applies to netcdf classic, not HDF5. Plus, I read that as meaning that the library does that for you (so the python layer doesn't need to). |
Hmm -- I'm pretty sure that all variable and dimension names are supposed to be NFC normalized. The sectionof the NUG does talk about he Header, so yes, probably only vital for netcdf classic. But still a good idea, and CF will be requiring it anyway. The search on the NUG is broken, so I'm having a hard time finding what I'm looking for :-(
I doubt it -- but worth a look. It would be great if it did. I'll try to poke into it. |
OK -- I've poked into it, and you are completely correct -- the netCDF C lib is NFC normalizing variable names. Here's an experiment with netCDF4:
And when run:
So indeed, the C lib is doing it for you -- nothing to be done here. Except maybe a note in the docs ... |
Another potential issue -- not sure if this is something that should be built in to the lib: The next version of CF will specify that attributes should be NFC normalized. This is because a number of CF attributes reference variable names, so they really need to be exact / compare equally. I just tested, and string attributes are not being normalized. So the netCDF4 lib could normalize attributes too. (so could the C lib, but I'm guessing they won't want to go there -- it's not critical to netcdf itself) |
The NUG indicates that strings (dimension and variable names, anyway) should be NFC normalized.
"""
... names are normalized according to Unicode NFC normalization rules during encoding as UTF-8 for storing in the file header. This is necessary to ensure that gratuitous differences in the representation of Unicode names do not cause anomalies in comparing files and querying data objects by name.
"""
(and next CF release will specify NFC normalization for all text)
But as far as I can tell, netCDF4 isn't doing that. It probably should.
I think it may be as easy as adding:
to
_strencode()
Granted -- this does mean that users may get something slightly different back when they round-trip a anme through netcdf.
If that's a concern, the you could call
unicodedata.is_normalized
, and raiae an error instead.The text was updated successfully, but these errors were encountered: