A news crawler for BBC News, Reuters and New York Times.
- python3
- configobj
- dateutil
- requests
- bs4
- goose3
pip install -r requirements.txt
- xxx_crawler: the executive file to crawl news.
- xxx.cfg: configurations for the crawler, including api, time range and storage path etc.
- xxx_link.py: fetch download links.
- xxx_article: extract content and some meta data of one news article.
python bbc_crawler.py
python reuters_crawler.py
python nytimes_crawler.py
Modify reuters.cfg
, nytimes.cfg
and bbc.cfg
in settings folder, the main configuration items may be start_date
, end_date
and path
.
If other news sources need to be added, just add files as the architecture, extend the basic class in each folder. Some methods may need to be rewrote.