-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZIM for ab_all_maxi has different sizes between 1.13 and 1.14 #2071
Comments
ZIM file links: |
Here is a .tsv for every entry in the ZIMs. It has the format: path,june size, july size Doing some analysis in pandas, we see that there are 681 webps that are larger in July, out of 2969 total webps: The mean size difference is +10,661 bytes for those webps that are larger. However, the total difference, including webps that are smaller in July, is only 6.77 MB: |
Clearly I'm doing something wrong, because the total sums of July sizes and June sizes are only 107 MB and 110 MB: I tried to iterate over all entries in the ZIM using the
|
Your tsv is not filtered on WEBP files ; it contains all entries, including compressed ones (text) and indexes. |
|
July WEBP are 24129838 / 23.01 MiB |
The pandas code limits it to webp:
|
The larger question is why is my total size 118 MB? Edit: I misread the original ZIM sizes as 26/36 GB instead of MB. So actually, uncompressed, 110/118 MB makes sense. |
Here is my Jupyter notebook with analysis: https://github.com/audiodude/zim-investigation/blob/main/compare.ipynb |
Doing a more "apples to apples" comparison of the wiki scraped right now with 1.13 versus 1.14, the discrepancy is much less:
|
Dumping some of the webps from the respective ZIMs, we see that the 1.14 ones are much bigger:
Confirmed manually that the 1.14 images have much bigger dimensions. |
Confirmed this is due to larger images. |
@audiodude What does that mean concretly it term of resolution and quality? Are they all impacted in the same manner? |
@audiodude I'm not against to downscale images but:
|
I noticed this with https://download.kiwix.org/zim/wikivoyage/wikivoyage_en_all_maxi_2024-08.zim, which is scraped with 1.13 from new endpoint, has larger images at least in terms of display dimensions, but which hardly increases the ZIM size compared to ZIMs scraped from the old endpoint. I actually rather like the larger display size for images at least in that Wikivoyage version (which I've just released as a packaged app). If we could hit that sweet-spot in terms of display-size vs compression, it would be a good solution IMHO. What is 1.13 doing right here? |
The ZIM scraped in July 2024 by 1.14 has a different size than the one scraped in June 2024 by 1.13:
Oddly, this is the opposite problem as #2070. We don't yet know what the issue might be.
The text was updated successfully, but these errors were encountered: