-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add La Vanguardia
#637
base: add-abc
Are you sure you want to change the base?
Add La Vanguardia
#637
Conversation
# Conflicts: # src/fundus/publishers/es/__init__.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding 👍
src/fundus/publishers/es/__init__.py
Outdated
NewsMap("https://www.lavanguardia.com/newsml/home.xml"), | ||
RSSFeed("https://www.lavanguardia.com/rss/home.xml"), | ||
RSSFeed("https://www.lavanguardia.com/rss/internacional.xml"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be sitemaps as well https://www.lavanguardia.com/sitemap-noticias-202102.xml.gz
as well as two other NewsMaps
:
https://www.lavanguardia.com/sitemap-google-news.xml
https://www.lavanguardia.com/sitemap-news-agencias.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah perfect, I missed them
|
||
@attribute | ||
def authors(self) -> List[str]: | ||
return generic_author_parsing(self.precomputed.ld.bf_search("author")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some encoding errors for the author
field when parsing this article.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out, it seems to be the case that, if there is this ZWSP character, it seems to be followed by information unrelated to the author, so it can safely be just removed
_summary_selector = XPath("//h2[@class='epigraph']") | ||
|
||
@attribute | ||
def body(self) -> Optional[ArticleBody]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The selector seems to have trouble parsing this article
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me, as if there is nothing we can do about it, since the content is loaded using a script. The HTML we get in Fundus seems to mostly be scripts
def topics(self) -> List[str]: | ||
return generic_topic_parsing(self.precomputed.meta.get("Keywords")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could argue that the topics at the page's bottom are more descriptive. What do you think?
No description provided.