news-crawler

A news crawler for BBC News, Reuters and New York Times.

Requirements

python3
configobj
dateutil
requests
bs4
goose3

pip install -r requirements.txt

Architecture

xxx_crawler: the executive file to crawl news.
xxx.cfg: configurations for the crawler, including api, time range and storage path etc.
xxx_link.py: fetch download links.
xxx_article: extract content and some meta data of one news article.

Usage

BBC News

python bbc_crawler.py

Reuters

python reuters_crawler.py

New York Times

python nytimes_crawler.py

Configuration

Modify reuters.cfg, nytimes.cfg and bbc.cfg in settings folder, the main configuration items may be start_date, end_date and path.

Notes

If other news sources need to be added, just add files as the architecture, extend the basic class in each folder. Some methods may need to be rewrote.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

news-crawler

Requirements

Architecture

Usage

BBC News

Reuters

New York Times

Configuration

Notes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
article		article
link		link
network		network
settings		settings
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
bbc_crawler.py		bbc_crawler.py
nytimes_crawler.py		nytimes_crawler.py
requirements.txt		requirements.txt
reuters_crawler.py		reuters_crawler.py

g-jing/news-crawler

Folders and files

Latest commit

History

Repository files navigation

news-crawler

Requirements

Architecture

Usage

BBC News

Reuters

New York Times

Configuration

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages