Some article texts are not fully downloaded. #950

Jimchoo91 · 2022-08-31T15:11:57Z

Hi, I have only found this on one website so far, but when I try to download the full text from an article on the BBC, it only returns a snippet.

Here is an example website:

https://www.bbc.co.uk/news/world-48810070

Any idea why? Thanks.

bstivers · 2022-09-17T08:36:14Z

While it's more than a snippet, the full text of articles from Politico don't get pulled either.

I believe the main issue at heart is the code used to parse these websites is so old (last commit to main code is 4+ years old), it's not handling the html source properly due to website updates. Big name websites will change their layouts a lot more frequently than 4 years.

I am sure this library was great in its hay day, but it's near unusable now unless it's on smaller websites that haven't changed a thing in the last 5 years. Which doesn't leave many given that even WordPress-based websites have changed quite a bit.

johnbumgarner · 2022-12-30T18:09:55Z

The library has lots of limitations, because the code base is old. You can parse the BBC site text with some additional code. Here is a document that I wrote on using the library. I will update it in the coming days with the code to extract the BBC text.

AndyTheFactory mentioned this issue Oct 24, 2023

Some article texts are not fully downloaded. AndyTheFactory/newspaper4k#563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some article texts are not fully downloaded. #950

Some article texts are not fully downloaded. #950

Jimchoo91 commented Aug 31, 2022

bstivers commented Sep 17, 2022 •

edited

Loading

johnbumgarner commented Dec 30, 2022

Some article texts are not fully downloaded. #950

Some article texts are not fully downloaded. #950

Comments

Jimchoo91 commented Aug 31, 2022

bstivers commented Sep 17, 2022 • edited Loading

johnbumgarner commented Dec 30, 2022

bstivers commented Sep 17, 2022 •

edited

Loading