Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer to open zip files with unzip instead of 7z #5

Open
gapan opened this issue Oct 24, 2012 · 77 comments
Open

Prefer to open zip files with unzip instead of 7z #5

gapan opened this issue Oct 24, 2012 · 77 comments

Comments

@gapan
Copy link

gapan commented Oct 24, 2012

7z and also versions of unzip older than 6.10b, don't support utf8 properly. Zip archives created on windows that have files with non-latin filenames show up as garbage.

Here's a sample zip archive that displays this problem:
http://pnboy.pinguix.com/gapan/23-10-2012-b-fasi-eaep.zip

Same problem is with cbz files, which are actually zip files.

Removing/commenting out lines 527 and 530 from src/fr-command-7z.c fixes this and opens zip files with unzip, but it removes the mimetypes from 7z completely.

Ideally, priority should be given to unzip first and if unzip is not present, fall back to 7z, if 7z is installed.

@stefano-k
Copy link
Collaborator

I can confirm the issue

@ghost
Copy link

ghost commented Oct 24, 2012

So that we have a record on this:

The problem you describe isn't a problem of 7z, it's a problem of whatever you used to create that zip file, and my humble interpretation of the sample file you showed on #mate @ freenode proves it.

If you extract the files with unzip and then zip them with 7z, everything works fine... which leads me to believe that the zip wasn't created in a machine wich supports UTF-8, but instead any sort of iso-western (If you want I can tell you exactly which).

In other words, the problem is with whatever piece of trash that file was zipped. There's no real problem and the performance of zip is worst than the one of p7zip, so we're making a lot of users pay for a problem that was generated probably in a windows machine :)

7z does support UTF-8 properly.

@ghost
Copy link

ghost commented Oct 24, 2012

And why 'ideally' the priority to unzip/zip instead of 7z? Can you mention a single technical reason? Are you aware this will affect far more people than you?

@stefano-k
Copy link
Collaborator

We will give only the option to choose the desired backend, nothing more

@gapan
Copy link
Author

gapan commented Oct 25, 2012

@ketheriel, if the only thing that you want to do is to blame someone, then feel free to blame anyone you like.

Yes, it is the fault of Windows not supporting utf8 properly. But the fact remains that 99% of zip files are created on the Windows platform. *nix users prefer tarballs instead. And the fact remains that all zip files that were created on Windows and that include non-latin filenames display as garbage in linux as it is. I don't care about blaming anyone, I care about fixing this. Of course I am aware that this will affect more people than me. It will fix this nightmare for other people too, not just me.

This issue has been fixed in unzip 6.10b and I would like this fix to be carried into mate-file-archiver.

Ideally, it should prefer unzip over p7zip, because newer versions of unzip behave properly. The technical reason is that otherwise filenames show as garbage. If that's not enough for you, then I don't know what else could be.

Also, performance of unzip is better than p7zip, so your point doesn't stand there.

And p7zip doesn't support utf8 charsets the same way that older versions of unzip didn't. Here's a bug report that says the same:
https://bugs.archlinux.org/task/18691

Also, here are bug reports from debian, ubuntu and gentoo about the utf8 mess with unzip:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=197427
https://bugs.launchpad.net/debian/+source/unzip/+bug/10979
https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/203609
https://bugs.gentoo.org/show_bug.cgi?id=69945

@ghost
Copy link

ghost commented Oct 25, 2012

Dude go dig some facts and leave the FUD; you are wrong since moment one, and one easy way to prove it is to make an archive in a UTF-8 environment and then if you extract it with p7z it works :)

You know, before I answered I tested all potential scenarios; that's the difference between you and me! At least I test stuff out before comming big and bad and trying to move people to ugly hacks that actually don't fix anything, instead they only sweep the trash under the rug. How can that help users ? :) Let me guess... now we tell others what software they should use ? :)

@gapan
Copy link
Author

gapan commented Oct 25, 2012

Hi dude. Troll much? Sorry, I won't bite.

@mahiuchun
Copy link

As I tried with unzip 6.0, it doesn't even able to list file names in Unicode enabled ZIP archives correctly.
But 7z can handle Unicode enabled ZIP archives with no problem.

Beta version of unzip may be better, but its developer said it has known issues and the time of next release is unknown.

For non-Unicode enabled ZIP archives, lsar/unar is the best tool to use, as far as I can tell. lsar/unar has built-in support of auto encoding detection and encoding conversion.
http://code.google.com/p/theunarchiver/

@Yanpas
Copy link

Yanpas commented May 1, 2015

What about adding option to engrampa settings which tool to use? By default it would be 7z, but you may switch easily to unzip

@gapan
Copy link
Author

gapan commented Jun 8, 2016

Newer p7zip (at least version 15.14.1) autodetects the correct encoding and converts it to utf8 properly. So, I think this can be closed now.

Probably also fixes #102

@gapan gapan closed this as completed Jun 8, 2016
@Yanpas
Copy link

Yanpas commented Jun 8, 2016

Ubuntu 16.04 uses 9.20.1~dfsg.1-4.2, where did you manage to find this version?)

By the way, will engrampa provide backend choose or not?

@gapan
Copy link
Author

gapan commented Jun 9, 2016

@Yanpas
I compiled it myself. Sorry, I don't use ubuntu and I cannot give instructions for it.

But I realize that maybe I was too quick to close this. It works with the file I posted here and similar ones, but not others. Also, the issue had turned from "make unzip the default" to "provide the option to choose with unpacker to use". I'm reopening...

@gapan gapan reopened this Jun 9, 2016
@unxed
Copy link

unxed commented Sep 29, 2016

.debs for p7zip and p7zip-full 16.02 are here:

x32
https://packages.debian.org/sid/i386/p7zip/download
https://packages.debian.org/sid/i386/p7zip-full/download

x64
https://packages.debian.org/sid/amd64/p7zip/download
https://packages.debian.org/sid/amd64/p7zip-full/download

it works ok with the file attached above, but fails on my attached example.

unzip 6.0 work ok for me with both files.
case2.zip

btw, do anyone know how to file a bug to p7zip about this?

@Yanpas
Copy link

Yanpas commented Sep 29, 2016

@unxed
Copy link

unxed commented Sep 29, 2016

corresponding p7zip's issue:
https://sourceforge.net/p/p7zip/bugs/187/

Btw, windows version of 7za 16.03 under wine handles my case2.zip correctly (but fails with 23-10-2012-b-fasi-eaep.zip, corresponding wine's issue: https://bugs.winehq.org/show_bug.cgi?id=41411). Same with 16.02 under wine. So the problem is seen only in linux build.

Upd: got an answer from p7zip:

It uses OEM (DOS) encoding.
p7zip doesn't support it.

But even modern windows versions write .zip files with OEM-encoded filenames!

@txtsd
Copy link

txtsd commented Jan 6, 2017

How difficult would it be to put 7z at the end of the priority list for opening archives? Engrampa is extremely slow at opening regular archives when I have p7zip installed. If it's just a few lines of if-elses, I can try to look into it and submit a PR.

@alkisg
Copy link

alkisg commented May 22, 2017

There are some persons here claiming that this isn't an actual issue and that people should instead create proper utf-8 based .zip files etc.
Here is an example that should prove how valid this issue is, i.e. just one of the ways users still create .zip files with OEM encoding.

I open a fully updated Greek Windows 10 installation. I right click on the desktop, select "New text file", and I get this filename: Νέο έγγραφο κειμένου.txt
Then I right click on that file, select "Send to zip", and name the zip file "win10test.zip". I'm attaching it here:
win10test.zip

Then I try to uncompress it with unzip/7za, and I get the following results:
7za l win10test.zip:
2017-05-22 08:03:38 ....A 0 0 �⦠â��¨�­¦ ¡� £â¤¦¬.txt
unzip -l win10test.zip:
0 2017-05-22 08:03 Мтж тЪЪиШнж бЬагтджм.txt
unzip -l -O cp737 win10test.zip:
0 2017-05-22 08:03 Νέο έγγραφο κειμένου.txt

Obviously, only the last try is correct. And since we're not able to fix Windows, or train millions of Windows users not to use the embedded zipping tools, or replace the millions of .zip files our there that already use the OEM charset, we should fix it in our side.

So to make engrampa correctly open .zip files, I'm doing the following workarounds in my installations:

  1. sudo chmod -x /usr/bin/7z /usr/bin/7za
  2. Create a wrapper in /usr/local/bin/unzip, that runs export UNZIP="-O $charset"; export ZIPINFO="-O $charset" before exec'ing /usr/bin/unzip.

It would be most useful if engrampa had some option that would allow me to more properly apply my workaround without having to chmod -x system files, for example an option to define the preferred order for the various zip tools that are installed.

@TomaszGasior
Copy link

I would like to use Unzip over than 7zip. It would be nice if user have ability to change this by dconf setting.
If Engrampa uses Unzip while creating or unpacking archive, progress bar in GUI is more accurate — it shows acutal amount of files instead of "please wait" message.

@alkisg
Copy link

alkisg commented Jun 21, 2020

Hi, this is still an issue with the MATE 1.24, p7zip 16.02, and Windows 10 v2004 right click > create zip.
The p7zip developer has replied "use 7zip via wine instead", which I believe is a strong reason to prefer unzip over p7zip, but anyway, ...

I submitted a pull request that allows to configure engrampa to prefer unzip, by specifying the UNZIP="-O cp737" environment variable.

@unxed
Copy link

unxed commented Jun 21, 2020

Why can not engrampa detect appropriate DOS encoding by system locale settings and set environment variables automatically?

@alkisg
Copy link

alkisg commented Jun 21, 2020

That would be a patch for unzip, not for engrampa. There have been various efforts for that for more than 10 years, but none of them became mainstream enough to reach the distributions. But the "-O charset" environment variables are supported.

We only ask from engrampa to allow us to prefer unzip, at least temporarily; engrampa code shouldn't bother with encodings at all.

@unxed
Copy link

unxed commented Jun 21, 2020

@alkisg thanks, maybe you can tell where unzip development is happening? Not ubuntu package, but unzip source code itself.

@alkisg
Copy link

alkisg commented Jun 21, 2020

Note that the Ubuntu link was just one of many; see a similar example for archilinux here, and I think most distros have something similar.

unxed, if you're planning to add support for codepage autodetection to unzip, then it would probably be best to add it to p7zip instead.
Then engrampa wouldn't need to be fixed at all, it could just continue to prefer p7zip.

One way to quickly find the upstream for packages, is to visit their debian pages, and click on the "homepage" link to the right:

https://tracker.debian.org/pkg/unzip => http://www.info-zip.org/UnZip.html (currently seems down)
https://tracker.debian.org/pkg/p7zip => http://p7zip.sourceforge.ne

@unxed
Copy link

unxed commented Jun 21, 2020

@alkisg thanks again, but info-zip.org seems to be down.

As for p7zip, https://sourceforge.net/p/p7zip/bugs/187/

Probably p7zip developer doesn't think that this feature is too important, Or it can be difficult to implement.

Should I hope the patch will be accepted if developer not even interested in the implementation?

@alkisg
Copy link

alkisg commented Jun 29, 2020

Encoding: Shift_JIS (57% confidence)

To me, a decompressing solution that has 57% or even 95% success rate feels completely unacceptable. It's a bomb; users would decompress .zip files and at some point they'd discover that some or all of their filenames have been damaged; and they'd have to start looking in all their disk for such possibly damaged filenames from weeks or months ago, because noone notified them that the software they use only works some times.

Facial recognition or OCR software is allowed to work "some times" by its nature, but not normal programs like archivers, file managers, calculators, editors etc. Users expect these to work consistently.

@alkisg
Copy link

alkisg commented Jun 29, 2020

To sum up:

  1. If OEMCP is defined, we use it. This allows using any iconv encoding without relying in LANG. It does require manual configuration from the user or an UI option, and that's fine.
  2. If not, we try to map the current locale to OEMCP. This allows users to unzip files from their locale without configuring anything, and it's the most important part. The requirement to "have the locale generated" doesn't matter in this case, as we're using the user's locale, so it's already generated.

Does everyone agree on these 2?

The (1) part can be easily implemented by adding the following lines to unxed's existing patch:

    oemcp = getenv("OEMCP");

    if (!oemcp) {
        oemcp = "CP437";
       ...(rest of the locale-to-oemcp code here)

I tested that and it works fine:

$ OEMCP=Shift_JIS 7za l nenngajyou-data.zip 
...
2014-10-28 09:43:06 ....A        21976        20162  nenngajyou-data/年賀状住所印刷テンプレート.odt
2014-10-28 09:26:48 ....A        80823        77876  nenngajyou-data/年賀状住所録.ods

@ghost
Copy link

ghost commented Jun 29, 2020

Can you please post the full patch for p7zip? Thanks!

(I see that the updated version uses oemcp while the previous version uses oem_cp, so I am not sure whether I have patched correctly. Sorry for the inconvenience caused.)

@alkisg
Copy link

alkisg commented Jun 29, 2020

Let's wait for @unxed to upload whatever he thinks is best for a final version, as he's the one that developed the patch in the first place.

I used oemcp without the dash because that's what Windows calls it. E.g. google for "oemcp" including the quotes.

@unxed
Copy link

unxed commented Jul 15, 2020

Sorry, guys, overloaded with tasks now.

One important thing: our p7zip patch was buggy.

Instead of

+  if (!isUtf8) {
+    const char *lc_to_cp_table[] = {

we should use

+  Byte hostOS = GetHostOS();
+  if (!isUtf8 && ((hostOS == NFileHeader::NHostOS::kFAT) || (hostOS == NFileHeader::NHostOS::kNTFS))) {
+    const char *lc_to_cp_table[] = {

The problem is MacOS X. It writes file names in UTF8, but does not sets UTF8 header flag. But we can assume that file names in archive are written in OEM code page by DOS and Windows archivers only, so we check platform flag in zip header and do not do code page conversion for archives created on OS X.

Sample attached. Folder in archive should be named "Штабик 2020", not "╨и╤В╨░╨▒╨╕╨║ 2020".
1.zip

@unxed
Copy link

unxed commented Jul 15, 2020

Let's consider this p7zip patch version as 'relatively final':
https://github.com/unxed/oemcp/blob/master/p7zip_oemcp_ZipItem.cpp.patch

Of course we still need to fix last two issues from here #5 (comment).

@alkisg
Copy link

alkisg commented Jul 16, 2020

@unxed, great, thank you,

There are two CPs for "az_AZ" in our table. Not sure how to distinguish...

I think we should use the most common one, and the users can override it with the OEMCP environment variable. For example, for Greece, there were actually many different encodings, not just cp737. But if some user managed to create a .zip with cp869, I wouldn't want p7zip to try to autodetect that (with a 50% success rate!), as it's very rare. I would want to manually set OEMCP=cp869.
(edit: note that this isn't related to the Linux environment where p7zip runs, but to the Windows environment where the .zip was created, that's why I'm suggesting we shouldn't try to autodetect that)

iconv does not support CP720...

I think we should NOT try to workaround iconv shortcomings. Affected users should try to send a patch to iconv, which will make all programs that use iconv work, not just p7zip. Let's keep the code clean and patches where they belong.

Let's consider this p7zip patch version as 'relatively final'...

Could you please attach it to the upstream bug report?

I think we should close this engrampa issue now, and we should focus on the upstream bug reports, and on distribution-specific bug reports.
I will file a bug report in Debian for this, and I think I should link mainly to the upstream bug, not to an engrampa issue, even though here's where most of the chat happened... So let's continue the chat there in the p7zip bug tracker. Many thanks to the MATE developers that hosted all this chat here.

@unxed
Copy link

unxed commented Jul 16, 2020

we should use the most common one

Not sure how to find out most common ones, so let's wait for reports from people who are actually using that locales.

Could you please attach it to the upstream bug report?

Done :)

UPD: Updated unzip patch for OEMCP env variable support also:
https://sourceforge.net/p/infozip/patches/29/

UPD#2:
unzip patch in Debian https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=197427#70
p7zip patch in Debian https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=965126

UPD#3:
Found older similar approach:
https://github.com/vitlav/libnatspec
It also uses CP data from Wine: https://github.com/vitlav/libnatspec/blob/master/lib/data/get_charset_data.h
There are patches for unzip and p7zip using this lib:
https://github.com/zip-i18n
https://aur.archlinux.org/packages/unzip-natspec/
https://aur.archlinux.org/packages/p7zip-natspec/
...and even ppa with builds for 18.04 https://launchpad.net/~spvkgn/+archive/ubuntu/p7zip-natspec

libnatspec was last updated in 2010. It has some pros and contras against my approach. Pros: 1) it differs uz_UZ and uz_UZ@cyrillic 2) it better supports ukrainian; contras: it does not support greek. There may be other differences as well. Seems libnatspec's data is not raw Wine dump, but is manually tuned somehow. Request for code page info update from recent Wine: Etersoft/libnatspec#3

UPD#4: Another interesting issue is how to create .zip files on linux with filenames readable both by windows and *nixes and preserve all UTF8 characters when possible. It's not that easy, still possible. We should write "0" to "HostOS" flag, "0" to UTF8 flag, write file names in OEM code page and also write UTF8 version of file names in additional 0x7075 field. This is exactly that winzip and winrar do. Unfortunately p7zip and info-zip do not mimic this behavior currently, so zip files generated by them have ruined filenames when opened on windows. Sample of correct multi-platform .zip with both OEM and UTF8 filenames versions (files inside should be named "абвгде" and "жзийкл"):
winzip.zip
Suggested zip-i18n developer to implement this logic: zip-i18n/p7zip#1 zip-i18n/zip#1

@unxed
Copy link

unxed commented Jul 20, 2020

Patch accepted to szcnick's p7zip fork, hooray! @alkisg what about switching to this repo as source for your ppa builds?

p7zip-project/p7zip@e56ea97

@alkisg
Copy link

alkisg commented Jul 23, 2020

@unxed, thank you for filing https://bugs.debian.org/965126!
All affected users in Debian-based distributions should comment there, so that the patch gets accepted faster.
For other distributions, similar bug reports should be filed.

Regarding szcnick's p7zip fork, that would be another bug report for Debian to switch to that, as it's more actively maintained.
The PPA builds are a temporary solution until the patch is accepted in Debian; it's not a good idea to switch to a different source than Debian as then this will need to be maintained indefinitely.

@sc0w
Copy link
Member

sc0w commented Jul 27, 2020

I think p7zip fork must change the name, and then the people can propose it to package for distros, if not I think it never succeed.

@unxed
Copy link

unxed commented Jul 28, 2020

@sc0w Sounds logical. Maybe suggest it in their bug tracker? https://github.com/szcnick/p7zip/issues

@unxed
Copy link

unxed commented Oct 3, 2020

One more linux tool now have smart .zip oem charset detection implemented. Btw, there is another table of locale to oem code page translation. Not sure if it is better then mine.

Also found one more bug in unzip. UPD: and one more.

@unxed
Copy link

unxed commented Oct 4, 2020

Python's ZipFile is also suffering from charset problems. Wrote two issues to it:
https://bugs.python.org/issue41928 (about supporting unicode name extra field 0x7075)
https://bugs.python.org/issue41929 (about using system locale for oem code page selection)

@unxed
Copy link

unxed commented Oct 4, 2020

Grand unified algorithm to read filenames from zip files correctly:

  1. Do zip entry have «Unicode Path Extra Field» (0x7075)? Use it for file name.
  2. Is Unicode flag (0x800) set in «Flags» Field of zip entry? Assume «Filename» Field is in UTF-8.
  3. Do «HostOS» Field of zip entry have values of 0 (FAT) or 11 (NTFS)? Assume «Filename» Field is in OEM charset corresponding to system locale.
  4. Assume «Filename» Field is in UTF-8.

Patched p7zip uses exactly this method, and is able to process all zip files in my test set correctly (my test set contains several zips generated by different packers on windows, macos, linux, and by online services). The same algorithm should be used in any zip unpacker wishing to process non-latin filenames as gently as possible.

Single line solution for Engrampa on Ubuntu 20.04+ amd64:
wget https://github.com/unxed/oemcp/raw/master/p7zip-oemcp.deb && sudo dpkg -i p7zip-oemcp.deb

@unxed
Copy link

unxed commented Oct 5, 2020

As distros does not seem to hurry up in fixing unzip, I found a workaround that can be used to mitigate the problem here and now. Introducing zipwrapper — a perl script wrapping around zip/unzip and solving all charset problems seamlessly.

Now we can just replace all "zip" and "unzip" strings in fr-command-zip.c to "zipwrapper zip" and "zipwrapper unzip", select zip/unzip by default for zip archives as said in this ticket's title, and have all charset problems solved out of the box without waiting for distro maintainers or asking users to install some .deb, add ppa and/or set some environment variables.

Feedback is appreciated :)

@unxed
Copy link

unxed commented Oct 8, 2020

@jloqfjgk can you files write an issue here: https://github.com/unxed/oemcp/issues

@unxed
Copy link

unxed commented Oct 9, 2020

Grand unified algorithm to read filenames from zip files correctly:

1. Do zip entry have «[Unicode Path Extra Field](https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT)» (0x7075)? Use it for file name.

2. Is Unicode flag (0x800) set in «Flags» Field of zip entry? Assume «Filename» Field is in UTF-8.

3. Do «HostOS» Field of zip entry have values of 0 (FAT) or 11 (NTFS)? Assume «Filename» Field is in OEM charset [corresponding to system locale](https://github.com/unxed/oemcp/blob/master/oemcp.txt).

4. Assume «Filename» Field is in UTF-8.

Wrote a Perl script demonstrating this logic. Reads .zip, shows files in it, for each file detects correct filename encoding, shows decoded filename, suggests command line switches for popular archivers (unzip, unar, bsdtar) to extract this file correctly (zip format allows different files inside archive to have names in different charsets). An essential tool for everyone who wants to figure out how to work correctly with file name encodings in zip archives.

https://github.com/unxed/oemcp/blob/master/ziplist

Usage: ziplist filename.zip [-p]

-p option is kind of self-testing: it will invoke all suggested archivers in "list files" mode to check if charset options were suggested correctly. Needs to have all three (unzip, unar, bsdtar) archivers installed for this mode to work right.

PS: Perl's Archive::Zip itself can't do this job right out of the box, but fortunately can be easily extended for that.

@unxed
Copy link

unxed commented Jan 10, 2021

@alkisg can you please help with troubleshooting this issue:
https://github.com/jinfeihan57/p7zip/issues/112

@alkisg
Copy link

alkisg commented Jan 10, 2021

@unxed, I tried to reproduce it but I wasn't able to. I tested in Ubuntu 20.04 with LANG=en_US.UTF-8 and the last p7zip version in https://github.com/jinfeihan57/p7zip/releases. Maybe it's related to the build options in Arch, but I don't have Arch...

@unxed
Copy link

unxed commented Mar 12, 2021

Native 7zip linux port released, wow. Unfortunately, with the same charset bug
https://sourceforge.net/p/sevenzip/discussion/45797/thread/cec5e63147/?page=1&limit=25#eaa7

@unxed
Copy link

unxed commented Aug 15, 2023

Let's consider this p7zip patch version as 'relatively final': https://github.com/unxed/oemcp/blob/master/p7zip_oemcp_ZipItem.cpp.patch

Of course we still need to fix last two issues from here #5 (comment).

Or libnatspec can be used instead as it was relatively recently updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants