Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images from page/mobile-html endpoint are too big #1925

Closed
VadimKovalenkoSNF opened this issue Oct 10, 2023 · 11 comments · Fixed by #2043 or #2101
Closed

Images from page/mobile-html endpoint are too big #1925

VadimKovalenkoSNF opened this issue Oct 10, 2023 · 11 comments · Fixed by #2043 or #2101
Assignees
Labels
Milestone

Comments

@VadimKovalenkoSNF
Copy link
Collaborator

WikimediaMobile API in #1903 relies on page/mobile-html endpoint when scraping Wikipedia articles. Most of the images that come from mobile-html are 640px in width which is not appropriate for the scrape process, because of the drastic increase of the final zim file. Check this article's images as an example - https://bm.wikipedia.org/api/rest_v1/page/mobile-html/Bamak%C9%94
Pay attention to width value in src attribute for each image, e.g https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Bamako_bridge2.jpg/640px-Bamako_bridge2.jpg has /640px placed there by mediawiki mobileapps service.

Related ticket in Phabricator: https://phabricator.wikimedia.org/T348529

@VadimKovalenkoSNF VadimKovalenkoSNF changed the title Images from page/mobile-html endpoind are to big Images from page/mobile-html endpoint are to big Oct 10, 2023
@kelson42 kelson42 added this to the 1.15.0 milestone Oct 13, 2023
@kelson42
Copy link
Collaborator

kelson42 commented Oct 13, 2023

Most of the images that come from mobile-html are 640px in width which is not appropriate for the scrape process,

No "Its not appropriate for a mobile end-point AFAIK". See https://www.browserstack.com/guide/ideal-screen-sizes-for-responsive-design#toc2. We should report this as a bug not as feature request.

@kelson42 kelson42 changed the title Images from page/mobile-html endpoint are to big Images from page/mobile-html endpoint are too big Oct 13, 2023
@kelson42
Copy link
Collaborator

Bug report done at https://phabricator.wikimedia.org/T349972

@kelson42 kelson42 self-assigned this Oct 29, 2023
@cscott
Copy link
Contributor

cscott commented Dec 19, 2023

Be careful that you're looking at the raw HTML served to the browser, not the HTML as modified by the lazy-loading mechanism. Fetching that HTML gives:

<td align="center" colspan="2" style="background:#f9f9f9;"><span class="mw-default-size"><a href="./Fichie
r:Mali-Bamako.png" class="mw-file-description" title="Bamako Mali kɔnɔ"><span class="mw-file-element pcs-l
azy-load-placeholder pcs-lazy-load-placeholder-pending" style="width: 200px;" data-class="mw-file-element"
 data-src="//upload.wikimedia.org/wikipedia/commons/4/46/Mali-Bamako.png" data-width="200" data-height="18
7" data-alt="Bamako Mali kɔnɔ" data-data-file-width="200" data-data-file-height="187"><span style="padding
-top: 93.5%;"></span></span></a></span></td></tr>
</tbody></table>

<figure class="pcs-widen-image-ancestor"><a href="./Fichier:Bamako_et_fleuve_Niger.jpg" class="mw-file-des
cription pcs-widen-image-ancestor"><span class="mw-file-element pcs-widen-image-override pcs-lazy-load-pla
ceholder pcs-lazy-load-placeholder-pending" style="width: 320px;" data-class="mw-file-element pcs-widen-im
age-override" data-src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Bamako_et_fleuve_Niger.jpg/320
px-Bamako_et_fleuve_Niger.jpg" data-srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Bamako_et_
fleuve_Niger.jpg/480px-Bamako_et_fleuve_Niger.jpg 1.5x" data-width="320" data-height="241" data-data-file-
width="600" data-data-file-height="450"><span style="padding-top: 75.3125%;"></span></span></a><figcaption
>Bamako</figcaption></figure>

There's no actual <img> tag there, it's all lazy loaded. Given that you are already presumably processing these lazy-load attributes in order to fetch the underlying resource, you can substitute any size preference you like, right?

The more fundamental question is whether or not kiwix wants to be fetching the "mobile HTML" in the first place, as opposed to the HTML used for desktop browsing.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 19, 2023

Be careful that you're looking at the raw HTML served to the browser, not the HTML as modified by the lazy-loading mechanism. Fetching that HTML gives:

Yes, this really a pain to have a public API not delivering proper HTML! But:

  • This is not the problem I reported
  • We always had to make this kind of transformation around thumbnails and I guess we will have to live with it and handle the transformation ourself...

There's no actual <img> tag there, it's all lazy loaded. Given that you are already presumably processing these lazy-load attributes in order to fetch the underlying resource, you can substitute any size preference you like, right?

Not sure what I should answer here... basically you ask us to (for each picture):

  • Discover on our own the size (width) which is specified in the wiki code
  • Rebuild the upstream picture URL (and download the picture)
  • Modify the whole thumbnail HTML/CSS code to get it right

Do I get that right? Because that sounds just very difficult and error prone to do... in particular why Wikimedia could not generate proper HTML (respecting both the given wiki code AND the mobile constraints) in a first place? More or less like before?

The more fundamental question is whether or not kiwix wants to be fetching the "mobile HTML" in the first place, as opposed to the HTML used for desktop browsing.

We want to serve our users properly... and the world goes mobile... Wikimedia goes mobile... Why should Kiwix do differently?

@cscott
Copy link
Contributor

cscott commented Dec 21, 2023

Well, in this case it's because the "mobile HTML" is optimized for an entirely different use case than yours -- one where we want to save bandwidth by deferring the loading of images as long as possible.

All the things you are asking for are present in the standard ("non-mobile") Parsoid HTML.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 21, 2023

@cscott The lazy loading is a non-topic here. Can we focus on the bad sizes of the images which IS the problem.

@audiodude
Copy link
Member

Re-opening.

The code added in #2043 to fix this code doesn't "work", in the sense that the retrieved images (in 1.14) end up being larger than those from the previous mobile-section API in 1.13. This leads directly to #2071.

Here is a summary of my understanding of the issue:

  1. The previous mobile-section API downsized images to provide "mobile friendly" versions, which were used by mwoffliner 1.13.
  2. mwoffliner 1.14 switches to mobile-html API.
  3. The mobile-html API actually upsizes images when possible (my guess is so they fit nicely on a phone screen). So the definition of "mobile friendly" has changed.
  4. In https://phabricator.wikimedia.org/T349972, mediawiki developers provided us with an API for data-data-file-original-src, which is the file size on the wiki without any upsizing.
  5. However, this original file size is, in many cases, still larger than what we were getting from mobile-section.

@audiodude audiodude reopened this Aug 4, 2024
@audiodude
Copy link
Member

@kelson42 what do you think?

The only solution I can think of is to dig up the code for mobile-section and duplicate it in mwoffliner, to request appropriately sized images. Remember, we don't need to process image data locally, we can request the resize via URL hacking.

@audiodude
Copy link
Member

@kelson42 what do you think?

The only solution I can think of is to dig up the code for mobile-section and duplicate it in mwoffliner, to request appropriately sized images. Remember, we don't need to process image data locally, we can request the resize via URL hacking.

I see that @VadimKovalenkoSNF had the same idea a while ago: #1903 (comment)

There is another approach which might be better - instead of page/mobile-html, use page/html and process sections using scripts from deprecated MCS but inside mwoffliner.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Nov 2, 2024

A third option is to use page-html, which supposedly has the capability to request appropriately sized images, and merely transform the HTML to mobile style internally. Long ago, I implemented this inside the PWA reader: it can switch a Wikimedia page (including Wikivoyage) from desktop to mobile style and vice versa. Admittedly, it's a long time since I've had a desktop-formatted ZIM to test this on, but switching the other way (mobile to desktop) works fine even on ZIMs currently produced with dev 1.14.

I hesitate to suggest this (and it's somewhat out of scope of this issue), but this could actually be an opportunity to provide an API in mwOffliner ZIMs to use any of the Wikimedia-provided styles: Vector, Vector legacy, MinervaNeue, MonoBook and Timeless. It's really just a bunch of CSS transforms in a bundled stylesheet applied over neutral HTML. I mean, of course there are gotchas, and I'm not saying it would be quick to provide the full functionality, but if designed right, the API could start with a simple transformation to a generic mobile style, possibly Vector, and then add the others as CSS packs later. See also #2086.

Just a thought. It's probably too difficult and/or too expensive in dev time to do something like this for 1.14, but if we're really stuck, it's always worth having alternatives to consider...😉

@kelson42
Copy link
Collaborator

kelson42 commented Dec 2, 2024

See #2107 for the follow-up issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants