Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIPS FOR FAST IMPROVEMENT #577

Closed
AndyTheFactory opened this issue Oct 24, 2023 · 1 comment
Closed

TIPS FOR FAST IMPROVEMENT #577

AndyTheFactory opened this issue Oct 24, 2023 · 1 comment
Labels
enhancement New feature or request
Milestone

Comments

@AndyTheFactory
Copy link
Owner

Issue by aleksandar-devedzic
Tue May 9 09:34:18 2023
Originally opened as codelucas/newspaper#969


I have extracted some meta tags, you can try to identify title, text, description and date by replacing provided tags in :

meta[property='{}']
meta[name='{}']
meta[itemprop='{}']

Meta tags for publication and modification date:

published_date
published_time
cXenseParse:publishtime
pubdate
publish_date
PublishDate
dcterms.created
rnews:datePublished
article:published_time
prism.publicationDate
displaydate
OriginalPublicationDate
og:published_time
datePublished
article_date_original
article.published
published_time_telegram
sailthru.date
datePublished
date
Date
original-publish-date
DC.date.issued
dc.date
DC.Date
parsely-pub-date
publishtime
publication_date
uploadDate
coverageEndTime
publishdate
publish-date
publishedAtDate
dcterms.date
publishedDate
creationDateTime
pub_date
updated_time
og:updated_time
datemodified
last-modified
Last-Modified
DC.date.modified
article:modified_time
modified_time
modifiedDateTime
dc.dcterms.modified
lastmod

Meta tags for title:

dc.title
og:title
headline
articletitle
article-title
parsely-title
title

Meta tags for description:

description
og:description

Meta tags for body:
articleBody
articleText

FYI
It would be good if you can fix/improve/adapt the code so that it can extract full information from these websites since these websites are the most popular websites in the world.
By "full information" i mean title, publication date and article body

CNN - https://edition.cnn.com/
BBC News - https://www.bbc.com/news
Reuters - https://www.reuters.com/
The New York Times - https://www.nytimes.com/
The Guardian - https://www.theguardian.com/international
Al Jazeera - https://www.aljazeera.com/
Associated Press (AP) News - https://apnews.com/
NBC News - https://www.nbcnews.com/
Fox News - https://www.foxnews.com/
USA Today - https://www.usatoday.com/
ABC News - https://abcnews.go.com/
CBS News - https://www.cbsnews.com/
The Washington Post - https://www.washingtonpost.com/
Time - https://time.com/
Forbes - https://www.forbes.com/
Bloomberg - https://www.bloomberg.com/
The Wall Street Journal - https://www.wsj.com/
The Huffington Post - https://www.huffpost.com/
The Independent - https://www.independent.co.uk/
The Sydney Morning Herald - https://www.smh.com.au/
The Economist - https://www.economist.com/
The Times of India - https://timesofindia.indiatimes.com/
The Daily Mail - https://www.dailymail.co.uk/home/index.html
The Telegraph - https://www.telegraph.co.uk/
The Sun - https://www.thesun.co.uk/
The Mirror - https://www.mirror.co.uk/
The Daily Beast - https://www.thedailybeast.com/
The Atlantic - https://www.theatlantic.com/
National Geographic - https://www.nationalgeographic.com/
Science Daily - https://www.sciencedaily.com/
The Verge - https://www.theverge.com/
Wired - https://www.wired.com/
TechCrunch - https://techcrunch.com/
Engadget - https://www.engadget.com/
Mashable - https://mashable.com/
Forbes India - https://www.forbesindia.com/
Hindustan Times - https://www.hindustantimes.com/
CNN Business - https://www.cnn.com/business
Financial Times - https://www.ft.com/
CNBC - https://www.cnbc.com/
Business Insider - https://www.businessinsider.com/
Politico - https://www.politico.eu/
The Hill - https://thehill.com/
The Washington Times - https://www.washingtontimes.com/
The Boston Globe - https://www.bostonglobe.com/
The LA Times - https://www.latimes.com/
The Chicago Tribune - https://www.chicagotribune.com/
The Sydney Morning Herald - https://www.smh.com.au/
The Globe and Mail - https://www.theglobeandmail.com/
The Toronto Star - https://www.thestar.com/

@AndyTheFactory AndyTheFactory added the enhancement New feature or request label Oct 25, 2023
@AndyTheFactory AndyTheFactory added this to the Release 0.9.2 milestone Nov 12, 2023
@AndyTheFactory
Copy link
Owner Author

added 0.9.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant