-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefer to open zip files with unzip instead of 7z #5
Comments
I can confirm the issue |
So that we have a record on this: The problem you describe isn't a problem of 7z, it's a problem of whatever you used to create that zip file, and my humble interpretation of the sample file you showed on #mate @ freenode proves it. If you extract the files with unzip and then zip them with 7z, everything works fine... which leads me to believe that the zip wasn't created in a machine wich supports UTF-8, but instead any sort of iso-western (If you want I can tell you exactly which). In other words, the problem is with whatever piece of trash that file was zipped. There's no real problem and the performance of zip is worst than the one of p7zip, so we're making a lot of users pay for a problem that was generated probably in a windows machine :) 7z does support UTF-8 properly. |
And why 'ideally' the priority to unzip/zip instead of 7z? Can you mention a single technical reason? Are you aware this will affect far more people than you? |
We will give only the option to choose the desired backend, nothing more |
@ketheriel, if the only thing that you want to do is to blame someone, then feel free to blame anyone you like. Yes, it is the fault of Windows not supporting utf8 properly. But the fact remains that 99% of zip files are created on the Windows platform. *nix users prefer tarballs instead. And the fact remains that all zip files that were created on Windows and that include non-latin filenames display as garbage in linux as it is. I don't care about blaming anyone, I care about fixing this. Of course I am aware that this will affect more people than me. It will fix this nightmare for other people too, not just me. This issue has been fixed in unzip 6.10b and I would like this fix to be carried into mate-file-archiver. Ideally, it should prefer unzip over p7zip, because newer versions of unzip behave properly. The technical reason is that otherwise filenames show as garbage. If that's not enough for you, then I don't know what else could be. Also, performance of unzip is better than p7zip, so your point doesn't stand there. And p7zip doesn't support utf8 charsets the same way that older versions of unzip didn't. Here's a bug report that says the same: Also, here are bug reports from debian, ubuntu and gentoo about the utf8 mess with unzip: |
Dude go dig some facts and leave the FUD; you are wrong since moment one, and one easy way to prove it is to make an archive in a UTF-8 environment and then if you extract it with p7z it works :) You know, before I answered I tested all potential scenarios; that's the difference between you and me! At least I test stuff out before comming big and bad and trying to move people to ugly hacks that actually don't fix anything, instead they only sweep the trash under the rug. How can that help users ? :) Let me guess... now we tell others what software they should use ? :) |
Hi dude. Troll much? Sorry, I won't bite. |
As I tried with unzip 6.0, it doesn't even able to list file names in Unicode enabled ZIP archives correctly. Beta version of unzip may be better, but its developer said it has known issues and the time of next release is unknown. For non-Unicode enabled ZIP archives, lsar/unar is the best tool to use, as far as I can tell. lsar/unar has built-in support of auto encoding detection and encoding conversion. |
What about adding option to engrampa settings which tool to use? By default it would be 7z, but you may switch easily to unzip |
Newer p7zip (at least version 15.14.1) autodetects the correct encoding and converts it to utf8 properly. So, I think this can be closed now. Probably also fixes #102 |
Ubuntu 16.04 uses 9.20.1~dfsg.1-4.2, where did you manage to find this version?) By the way, will engrampa provide backend choose or not? |
@Yanpas But I realize that maybe I was too quick to close this. It works with the file I posted here and similar ones, but not others. Also, the issue had turned from "make unzip the default" to "provide the option to choose with unpacker to use". I'm reopening... |
.debs for p7zip and p7zip-full 16.02 are here: x32 x64 it works ok with the file attached above, but fails on my attached example. unzip 6.0 work ok for me with both files. btw, do anyone know how to file a bug to p7zip about this? |
corresponding p7zip's issue: Btw, windows version of 7za 16.03 under wine handles my case2.zip correctly (but fails with 23-10-2012-b-fasi-eaep.zip, corresponding wine's issue: https://bugs.winehq.org/show_bug.cgi?id=41411). Same with 16.02 under wine. So the problem is seen only in linux build. Upd: got an answer from p7zip:
But even modern windows versions write .zip files with OEM-encoded filenames! |
How difficult would it be to put 7z at the end of the priority list for opening archives? Engrampa is extremely slow at opening regular archives when I have p7zip installed. If it's just a few lines of if-elses, I can try to look into it and submit a PR. |
There are some persons here claiming that this isn't an actual issue and that people should instead create proper utf-8 based .zip files etc. I open a fully updated Greek Windows 10 installation. I right click on the desktop, select "New text file", and I get this filename: Νέο έγγραφο κειμένου.txt Then I try to uncompress it with unzip/7za, and I get the following results: Obviously, only the last try is correct. And since we're not able to fix Windows, or train millions of Windows users not to use the embedded zipping tools, or replace the millions of .zip files our there that already use the OEM charset, we should fix it in our side. So to make engrampa correctly open .zip files, I'm doing the following workarounds in my installations:
It would be most useful if engrampa had some option that would allow me to more properly apply my workaround without having to |
I would like to use Unzip over than 7zip. It would be nice if user have ability to change this by dconf setting. |
Hi, this is still an issue with the MATE 1.24, p7zip 16.02, and Windows 10 v2004 right click > create zip. I submitted a pull request that allows to configure engrampa to prefer unzip, by specifying the |
Why can not engrampa detect appropriate DOS encoding by system locale settings and set environment variables automatically? |
That would be a patch for unzip, not for engrampa. There have been various efforts for that for more than 10 years, but none of them became mainstream enough to reach the distributions. But the "-O charset" environment variables are supported. We only ask from engrampa to allow us to prefer unzip, at least temporarily; engrampa code shouldn't bother with encodings at all. |
@alkisg thanks, maybe you can tell where unzip development is happening? Not ubuntu package, but unzip source code itself. |
Note that the Ubuntu link was just one of many; see a similar example for archilinux here, and I think most distros have something similar. unxed, if you're planning to add support for codepage autodetection to unzip, then it would probably be best to add it to p7zip instead. One way to quickly find the upstream for packages, is to visit their debian pages, and click on the "homepage" link to the right: https://tracker.debian.org/pkg/unzip => http://www.info-zip.org/UnZip.html (currently seems down) |
@alkisg thanks again, but info-zip.org seems to be down. As for p7zip, https://sourceforge.net/p/p7zip/bugs/187/
Should I hope the patch will be accepted if developer not even interested in the implementation? |
To me, a decompressing solution that has 57% or even 95% success rate feels completely unacceptable. It's a bomb; users would decompress .zip files and at some point they'd discover that some or all of their filenames have been damaged; and they'd have to start looking in all their disk for such possibly damaged filenames from weeks or months ago, because noone notified them that the software they use only works some times. Facial recognition or OCR software is allowed to work "some times" by its nature, but not normal programs like archivers, file managers, calculators, editors etc. Users expect these to work consistently. |
To sum up:
Does everyone agree on these 2? The (1) part can be easily implemented by adding the following lines to unxed's existing patch: oemcp = getenv("OEMCP");
if (!oemcp) {
oemcp = "CP437";
...(rest of the locale-to-oemcp code here) I tested that and it works fine: $ OEMCP=Shift_JIS 7za l nenngajyou-data.zip
...
2014-10-28 09:43:06 ....A 21976 20162 nenngajyou-data/年賀状住所印刷テンプレート.odt
2014-10-28 09:26:48 ....A 80823 77876 nenngajyou-data/年賀状住所録.ods |
Can you please post the full patch for p7zip? Thanks! (I see that the updated version uses |
Let's wait for @unxed to upload whatever he thinks is best for a final version, as he's the one that developed the patch in the first place. I used oemcp without the dash because that's what Windows calls it. E.g. google for "oemcp" including the quotes. |
Sorry, guys, overloaded with tasks now. One important thing: our p7zip patch was buggy. Instead of
we should use
The problem is MacOS X. It writes file names in UTF8, but does not sets UTF8 header flag. But we can assume that file names in archive are written in OEM code page by DOS and Windows archivers only, so we check platform flag in zip header and do not do code page conversion for archives created on OS X. Sample attached. Folder in archive should be named "Штабик 2020", not "╨и╤В╨░╨▒╨╕╨║ 2020". |
Let's consider this p7zip patch version as 'relatively final': Of course we still need to fix last two issues from here #5 (comment). |
@unxed, great, thank you,
I think we should use the most common one, and the users can override it with the OEMCP environment variable. For example, for Greece, there were actually many different encodings, not just cp737. But if some user managed to create a .zip with cp869, I wouldn't want p7zip to try to autodetect that (with a 50% success rate!), as it's very rare. I would want to manually set OEMCP=cp869.
I think we should NOT try to workaround iconv shortcomings. Affected users should try to send a patch to iconv, which will make all programs that use iconv work, not just p7zip. Let's keep the code clean and patches where they belong.
Could you please attach it to the upstream bug report? I think we should close this engrampa issue now, and we should focus on the upstream bug reports, and on distribution-specific bug reports. |
Not sure how to find out most common ones, so let's wait for reports from people who are actually using that locales.
Done :) UPD: Updated unzip patch for OEMCP env variable support also: UPD#2: UPD#3: libnatspec was last updated in 2010. It has some pros and contras against my approach. Pros: 1) it differs uz_UZ and uz_UZ@cyrillic 2) it better supports ukrainian; contras: it does not support greek. There may be other differences as well. Seems libnatspec's data is not raw Wine dump, but is manually tuned somehow. Request for code page info update from recent Wine: Etersoft/libnatspec#3 UPD#4: Another interesting issue is how to create .zip files on linux with filenames readable both by windows and *nixes and preserve all UTF8 characters when possible. It's not that easy, still possible. We should write "0" to "HostOS" flag, "0" to UTF8 flag, write file names in OEM code page and also write UTF8 version of file names in additional 0x7075 field. This is exactly that winzip and winrar do. Unfortunately p7zip and info-zip do not mimic this behavior currently, so zip files generated by them have ruined filenames when opened on windows. Sample of correct multi-platform .zip with both OEM and UTF8 filenames versions (files inside should be named "абвгде" and "жзийкл"): |
@unxed, thank you for filing https://bugs.debian.org/965126! Regarding szcnick's p7zip fork, that would be another bug report for Debian to switch to that, as it's more actively maintained. |
I think p7zip fork must change the name, and then the people can propose it to package for distros, if not I think it never succeed. |
@sc0w Sounds logical. Maybe suggest it in their bug tracker? https://github.com/szcnick/p7zip/issues |
One more linux tool now have smart .zip oem charset detection implemented. Btw, there is another table of locale to oem code page translation. Not sure if it is better then mine. Also found one more bug in unzip. UPD: and one more. |
Python's ZipFile is also suffering from charset problems. Wrote two issues to it: |
Grand unified algorithm to read filenames from zip files correctly:
Patched p7zip uses exactly this method, and is able to process all zip files in my test set correctly (my test set contains several zips generated by different packers on windows, macos, linux, and by online services). The same algorithm should be used in any zip unpacker wishing to process non-latin filenames as gently as possible. Single line solution for Engrampa on Ubuntu 20.04+ amd64: |
As distros does not seem to hurry up in fixing unzip, I found a workaround that can be used to mitigate the problem here and now. Introducing zipwrapper — a perl script wrapping around zip/unzip and solving all charset problems seamlessly. Now we can just replace all "zip" and "unzip" strings in fr-command-zip.c to "zipwrapper zip" and "zipwrapper unzip", select zip/unzip by default for zip archives as said in this ticket's title, and have all charset problems solved out of the box without waiting for distro maintainers or asking users to install some .deb, add ppa and/or set some environment variables. Feedback is appreciated :) |
@jloqfjgk can you files write an issue here: https://github.com/unxed/oemcp/issues |
Wrote a Perl script demonstrating this logic. Reads .zip, shows files in it, for each file detects correct filename encoding, shows decoded filename, suggests command line switches for popular archivers (unzip, unar, bsdtar) to extract this file correctly (zip format allows different files inside archive to have names in different charsets). An essential tool for everyone who wants to figure out how to work correctly with file name encodings in zip archives. https://github.com/unxed/oemcp/blob/master/ziplist Usage: ziplist filename.zip [-p] -p option is kind of self-testing: it will invoke all suggested archivers in "list files" mode to check if charset options were suggested correctly. Needs to have all three (unzip, unar, bsdtar) archivers installed for this mode to work right. PS: Perl's Archive::Zip itself can't do this job right out of the box, but fortunately can be easily extended for that. |
@alkisg can you please help with troubleshooting this issue: |
@unxed, I tried to reproduce it but I wasn't able to. I tested in Ubuntu 20.04 with LANG=en_US.UTF-8 and the last p7zip version in https://github.com/jinfeihan57/p7zip/releases. Maybe it's related to the build options in Arch, but I don't have Arch... |
Native 7zip linux port released, wow. Unfortunately, with the same charset bug |
Or libnatspec can be used instead as it was relatively recently updated. |
7z and also versions of unzip older than 6.10b, don't support utf8 properly. Zip archives created on windows that have files with non-latin filenames show up as garbage.
Here's a sample zip archive that displays this problem:
http://pnboy.pinguix.com/gapan/23-10-2012-b-fasi-eaep.zip
Same problem is with cbz files, which are actually zip files.
Removing/commenting out lines 527 and 530 from src/fr-command-7z.c fixes this and opens zip files with unzip, but it removes the mimetypes from 7z completely.
Ideally, priority should be given to unzip first and if unzip is not present, fall back to 7z, if 7z is installed.
The text was updated successfully, but these errors were encountered: