You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor suggest_dataset_basename (which has fairly parallel implementations
in ssap and sia on top) to produce correct file names across platforms. Discuss whether for security reasons we'd want to limit the charset to ASCII printables. See related #557
The text was updated successfully, but these errors were encountered:
On Tue, Jul 02, 2024 at 06:18:23PM +0000, Adrian wrote:
Refactor `suggest_dataset_basename` (which has fairly parallel implementations
in ssap and sia on top) to produce correct file names across
platforms. Discuss whether for security reasons we'd want to limit
the charset to ASCII printables. See related #557
When someone touches this code, I'd also like it if we re-considered
the design where you can pass in directories; this can lead to
surprising error message if the directory does not exist, and if we
have no strong reasons to automatically write files outside the
current directory, I'd prefer if we moved away from the corresponding
functionality.
As to limiting things to ASCII printables... well, I suppose in pyVO
we have to assume that the service operators are well-meaning;
fending off possible attacks from services (of which there are many)
is beyond our means. For instance, I would not forbid file names
with leading dots (that will produce hidden files on unix-like
machines) and not even files called "-r *" (which may have disastrous
consequences for unwary users).
Still, allowing arbitrary characters in opens up a large trove of
problems. For instance, you may produce file names that our users
cannot type, or that will mess up directory displays, in particular
when the names don't happen to be good utf-8. Also, different names may
look the same, depending on the choices of glyphs, but even worse
when the terminal font is missing glyphs and all you see is a series
of non-informational fallbacks.
In short: I'm sure we should map the titles we get to ASCII. The
trouble is to decide how. Just replacing all non-ASCII with an
underscore is fast and foolproof but will lead to many collisions
(which we already handle, so it's annoyance more than a problem, but
still).
What worked reasonably well for me in other settings: Replacing all
non-ASCII c with
html.entities.codepoint2name.get(c, "_")[0]
For instance, an ä will become an a, and so will å or á or even
α.
Refactor
suggest_dataset_basename
(which has fairly parallel implementationsin ssap and sia on top) to produce correct file names across platforms. Discuss whether for security reasons we'd want to limit the charset to ASCII printables. See related #557
The text was updated successfully, but these errors were encountered: