Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping dies with "Input buffer contains unsupported image format" if logo returns 301 #2028

Open
benoit74 opened this issue May 22, 2024 · 6 comments

Comments

@benoit74
Copy link
Contributor

Zimfarm recipe: https://farm.openzim.org/recipes/encyclopediaofmath.org_en_all

Zim-request details: openzim/zim-requests#964 (comment)

Log:

[error] [2024-05-22T12:56:44.462Z] Failed to run mwoffliner after [18s]: {
	"stack": "Error: Input buffer contains unsupported image format",
	"message": "Input buffer contains unsupported image format"
}
[error] [2024-05-22T12:56:44.462Z] 

**********

Input buffer contains unsupported image format

**********

It is pretty hard to tell which image has been grabbed and failed to be read, making the fix even more delicate.

@audiodude
Copy link
Member

From googling, it looks like that's an error message from the sharp library.

Probably happening here?

.buffer(await sharp(resp.data).toColorspace('srgb').toBuffer(), imageminOptions.get('webp').get(resp.headers['content-type']))

Might just be a bad URL that's getting erroneously read as image data.

@audiodude
Copy link
Member

The proximate cause of the error is the fact that the URL:

https://encyclopediaofmath.org/common/spr_logo.gif

Does a 301 to https://encyclopediaofmath.org/wiki/Main_Page

Presumably, this "logo" link is somewhere in the initial metadata that mwoffliner gathers about the wiki. So the bug is that mwoffliner is hardcoded to download this as image data and doesn't consider 301 to be an error status. It then crashes early on in the scraping process.

@audiodude
Copy link
Member

If the folks putting in the request would like to fix their problem without waiting for mwoffliner, I would suggest putting an image (even a 1x1 PNG) at that URL.

@benoit74
Copy link
Contributor Author

Thank you!

Would specifying a custom favicon with --customZimFavicon allow to bypass the code trying to download the "logo"?

@audiodude
Copy link
Member

Yes, I believe that would work

@audiodude audiodude changed the title Input buffer contains unsupported image format Scraping dies with "Input buffer contains unsupported image format" if logo returns 301 May 24, 2024
@kelson42
Copy link
Collaborator

Here the solution is IMHO to test the format of the image at the same time (early) like other ZIM metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants