You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While it's more than a snippet, the full text of articles from Politico don't get pulled either.
I believe the main issue at heart is the code used to parse these websites is so old (last commit to main code is 4+ years old), it's not handling the html source properly due to website updates. Big name websites will change their layouts a lot more frequently than 4 years.
I am sure this library was great in its hay day, but it's near unusable now unless it's on smaller websites that haven't changed a thing in the last 5 years. Which doesn't leave many given that even WordPress-based websites have changed quite a bit.
The library has lots of limitations, because the code base is old. You can parse the BBC site text with some additional code. Here is a document that I wrote on using the library. I will update it in the coming days with the code to extract the BBC text.
Hi, I have only found this on one website so far, but when I try to download the full text from an article on the BBC, it only returns a snippet.
Here is an example website:
https://www.bbc.co.uk/news/world-48810070
Any idea why? Thanks.
The text was updated successfully, but these errors were encountered: