Refactor the code that produces the name of the files returned by services #573

andamian · 2024-07-02T18:18:00Z

Refactor suggest_dataset_basename (which has fairly parallel implementations
in ssap and sia on top) to produce correct file names across platforms. Discuss whether for security reasons we'd want to limit the charset to ASCII printables. See related #557

The text was updated successfully, but these errors were encountered:

msdemlei · 2024-07-03T09:59:00Z

On Tue, Jul 02, 2024 at 06:18:23PM +0000, Adrian wrote: Refactor `suggest_dataset_basename` (which has fairly parallel implementations in ssap and sia on top) to produce correct file names across platforms. Discuss whether for security reasons we'd want to limit the charset to ASCII printables. See related #557

When someone touches this code, I'd also like it if we re-considered the design where you can pass in directories; this can lead to surprising error message if the directory does not exist, and if we have no strong reasons to automatically write files outside the current directory, I'd prefer if we moved away from the corresponding functionality. As to limiting things to ASCII printables... well, I suppose in pyVO we have to assume that the service operators are well-meaning; fending off possible attacks from services (of which there are many) is beyond our means. For instance, I would not forbid file names with leading dots (that will produce hidden files on unix-like machines) and not even files called "-r *" (which may have disastrous consequences for unwary users). Still, allowing arbitrary characters in opens up a large trove of problems. For instance, you may produce file names that our users cannot type, or that will mess up directory displays, in particular when the names don't happen to be good utf-8. Also, different names may look the same, depending on the choices of glyphs, but even worse when the terminal font is missing glyphs and all you see is a series of non-informational fallbacks. In short: I'm sure we should map the titles we get to ASCII. The trouble is to decide how. Just replacing all non-ASCII with an underscore is fast and foolproof but will lead to many collisions (which we already handle, so it's annoyance more than a problem, but still). What worked reasonably well for me in other settings: Replacing all non-ASCII c with html.entities.codepoint2name.get(c, "_")[0] For instance, an ä will become an a, and so will å or á or even α.

andamian added the bug label Jul 2, 2024

andamian added this to the Future milestone Jul 2, 2024

andamian mentioned this issue Jul 2, 2024

Defuse sia title characters #557

Merged

bsipocz removed this from the Future milestone Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the code that produces the name of the files returned by services #573

Refactor the code that produces the name of the files returned by services #573

andamian commented Jul 2, 2024

msdemlei commented Jul 3, 2024 via email

Refactor the code that produces the name of the files returned by services #573

Refactor the code that produces the name of the files returned by services #573

Comments

andamian commented Jul 2, 2024

msdemlei commented Jul 3, 2024 via email