-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure of www.wikidoc.org due to missing CSS dependency #2091
Comments
I tried running this locally, and got the same error, except for a different article:
Looking at the source website, I can't find articles for
For 1), my impression was that the general approach of mwoffliner is to fail if an article cannot be retrieved, except in the narrow case that it was deleted between the time the article list was built and when the data was requested. @kelson42 what are your thoughts? For 2), I would need to dig more into the way the article list is built, because I'm not not immediately familiar with it. |
Thank you! Regarding the fact that our attempts stop at a different article, this is not a surprise to me. From my experience, the order of articles list seems to be "random". |
They're not "random" really, just highly asynchronous as you pointed out in #2092 |
So I tried to start a PR that would ignore the
So I think the problem is definitely in the methodology for figuring out which articles to scrape. |
Okay, so mwoffliner fetches API responses from URLs like:
And uses the result to get a list of article titles to later download. This endpoint is returning the following:
So I hate to say it, but I think the wiki is misconfigured. It's returning an This page recommends running |
If we really need to, we can also filter pages with no revisions from the scrape. |
Now I'm getting this:
for this URL:
|
Is it a wiki misconfiguration or just a slightly broken database content? I already achieved to break database content on multiple occasion due to mediawiki bugs Anyway, I really consider the scraper should be capable to continue on such errors, and just stop if too many errors occurs. It is a sad from my PoV to not put a content offline just because the website has some small issues, and the website maintainer is not here anymore / capable to fix them. This happens just way too often. And we cannot expect the scraper user to list manually one by one all pages which finally have an issue. Or at least we should report all articles which have an issue at once, and fail the scrape, letting the user decide if it is ok for him to ignore these articles (adding them to the ignore list). |
Is is tolerated that a whole article is missing (so http 404). This scenario can happen any time because a user can anytime delete an article. What is not tolerated is that the backend does not deliver like it should (for example with timeouts or http 5xx errors). But here the situation seems different and not that easy to assess. |
Here's some more examples from
None of these entries have a So even if we had a configurable limit of missing/broken articles, in this case we would likely exceed it anyways. I don't think mwoffliner can do much when the wiki in question is very broken. |
We should definitly not skip articles without revision. At this stage we should understand why there is no revid. If this is a feature, then we will have to handle it, AFAIK and from a technical POV we could make all requests without giving a revid... and then it will take latest version. Again, this has to be confirmed. If this is somehow a bug, we should stop the scraping process properly with a proper error. |
Looking at https://www.mediawiki.org/wiki/Manual:RevisionDelete, it seems totally possible to completely hide/remove all revisions of a page. To be tested on a mediawiki instance to confirm of course. |
We do not currently use the revision ID in the request. We do not extract it from the article list scrape either. The fact that it's missing is a symptom of a more fundamental problem, probably the one that @benoit74 pointed out. I assume the practical reason is that they have articles that they don't want to be public facing, but they don't want to delete either. See this URL: The error is:
|
So with that said, I believe we should skip articles with no revisions. They have no public facing pages and are not a tangible part of the wiki. They are essentially "hidden". |
This makes sense to me, and it is not that different from a page returning a 404. The more complex question is "what should we do when we encounter a link to a page with no revid?". But this could be tracked in a distinct issue, and it is maybe even already handled by the scraper. |
I'm a bit surprised if revid is not used at all, hard to believe to me. The reason why the revid should not be totally ignored is that ultimatively I want to be able to deal with it, see #982 or #2072 for example. Here we need to confirm if this is he consequence of deleting manually a revision. I doubt a bit about that. My question would be: why as we retrieve the whole list of article titles of the wiki, we get these articles listed... although they are not available. If we face here a kind of feature we should probably fix the problem there. If at this stage the MediaWiki is not able to delicer a revid of an article to scrape later, then we should maybe skip it... but we clearly need to understand why this could happen. |
So someone (probably me) has to spin up a MediaWiki instance, install the plugin for hiding revisions, and confirm that the JSON I posted above is what is returned in that case? |
So I did it, I installed a local MediaWiki instance and enabled revision deletion. I tried to delete all revisions of a page and got this: So as we expected, the Mediawiki software should be requiring every article to have at least one revision. These wikidoc pages have been altered in some other way. |
@audiodude Thx for the effort, even if I'm not surprised about the conclusion. I should really have a look IMHO. |
I tried to create a ZIM of https://www.wikidoc.org/ with
docker run --rm --name mwoffliner_test ghcr.io/openzim/mwoffliner:dev mwoffliner --adminEmail="[email protected]" --customZimDescription="Desc" --format="novid:maxi" --mwUrl="https://www.wikidoc.org/" --mwWikiPath "index.php" --mwActionApiPath "api.php" --mwRestApiPath "rest.php" --publisher="openZIM" --webp --customZimTitle="Custom title" --verbose
It fails with following error:
Does it means we cannot ZIM this wiki just because we have one bad CSS configured? Is there a way to ignore it (it is probably not used anyway on live website if it does not exists)?
The text was updated successfully, but these errors were encountered: