Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some article texts are not fully downloaded. #950

Open
Jimchoo91 opened this issue Aug 31, 2022 · 2 comments
Open

Some article texts are not fully downloaded. #950

Jimchoo91 opened this issue Aug 31, 2022 · 2 comments

Comments

@Jimchoo91
Copy link

Hi, I have only found this on one website so far, but when I try to download the full text from an article on the BBC, it only returns a snippet.

Here is an example website:

https://www.bbc.co.uk/news/world-48810070

Any idea why? Thanks.

@bstivers
Copy link

bstivers commented Sep 17, 2022

While it's more than a snippet, the full text of articles from Politico don't get pulled either.

I believe the main issue at heart is the code used to parse these websites is so old (last commit to main code is 4+ years old), it's not handling the html source properly due to website updates. Big name websites will change their layouts a lot more frequently than 4 years.

I am sure this library was great in its hay day, but it's near unusable now unless it's on smaller websites that haven't changed a thing in the last 5 years. Which doesn't leave many given that even WordPress-based websites have changed quite a bit.

@johnbumgarner
Copy link

The library has lots of limitations, because the code base is old. You can parse the BBC site text with some additional code. Here is a document that I wrote on using the library. I will update it in the coming days with the code to extract the BBC text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants