Skip to content

Collection of spiders crawling articles on major German news sites for the Scrapy application framework.

Notifications You must be signed in to change notification settings

mengguiyouziyi/scrapy-german-news

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapy-german-news

Collection of spiders crawling articles on major German news sites for the Scrapy application framework. The spiders are usable as is, crawled data can be written to a PostgreSQL data base using the python psycopg2 package or to a file as JSON.

Table of Contents

## Introduction This is a collection of spiders for the Scrapy application framework. The spiders cover major German news sites. The repository includes a middleware for rotating user agents and two output pipelines: Writing to a PostgreSQL data base using the psycopg2 package and writing to a file in a JSON like structure. ### Covered Sites Following German news sites have spiders: * [Frankfurter Allgemeine Zeitung](http://www.faz.net) * [Focus Online](http://www.focus.de) * [Spiegel Online](http://www.spiegel.de) * [Stern](http://www.stern.de) * [Sueddeutsche Zeitung](http://www.sueddeutsche.de) * [Die Welt](http://www.welt.de) * [Zeit Online](http://www.zeit.de) * [n-tv](http://www.n-tv.de) * [Handelsblatt](http://www.handelsblatt.com) * [RP Online](http://www.rp-online.de)

Not the full sites are covered out of the box but only certain categories, see section Categories on how to modify them.

### Retrieved Data Following information is extracted from these sites: * URL * Author * Keywords * Date published * Date crawled * Short description * Full text ## Installation The `requirements.txt` file contains a list of all required packages and can be installed using `pip`: ```bash pip install -r requirements.txt ``` Alternatively the packages can be installed by hand: ```bash pip install scrapy psycopg2 ``` Scrapy is the framework for the spiders. Psycopg2 is used for a pipeline to write the results to a PostgreSQL data base and can be skipped if the pipeline is commented out in `crawler/pipelines.py`. ### Requirements * Python 2.7 (Scrapy is not yet released for Python 3) * Scrapy 1.0 ([website](http://scrapy.org/), [github](https://github.com/scrapy/scrapy), [doc](http://doc.scrapy.org/en/1.0/))

If the data base pipeline is used (Else you can remove it from crawler/pipelines.py):

## Configuration The configuration is in `crawler/settings.py`. ### Throttling As to not overload the web server with requests, a download delay of 3 seconds is set. Scrapy sends requests in an interval around this delay so it is not as easily recognizable as a crawler. To remove the delay at your own risk, comment the line: ```python DOWNLOAD_DELAY=3 ``` ### User Agent A middleware is included that allows for rotating user agents to make the crawler harder to identify as such. The middleware is implemented in `crawler/middlewares.py` and can be set on spider to spider case. To activate the rotating, uncomment the middleware: ```python DOWNLOADER_MIDDLEWARES = { # Enable for rotationg user agents 'crawler.middlewares.RotateUserAgentMiddleware': 110, } ``` The used user agents are defined below as the most commonly used agents: ```python USER_AGENT_CHOICES = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11) AppleWebKit/601.1.56 (KHTML, like Gecko) Version/9.0 Safari/601.1.56', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.7 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.7', 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko' ] ``` Change them or add your own as needed. ### Output Two different output methods are provided; writing to a PostgreSQL data base and writing to a file as JSON. The methods are implemented in `crawler/pipelines.py` To choose a method comment/uncomment the lines in `crawler/settings.py` as needed: ```python ITEM_PIPELINES = { 'crawler.pipelines.PostgresPipeline': 300, # 'crawler.pipelines.JsonWriterPipeline': 800, } ``` #### Database The settings for the data base are defined in the `crawler/settings.py`: ```python DATABASE = { 'drivername': 'postgres', 'host': 'localhost', 'port': '5432', 'username': 'postgres', 'password': 'password', 'database': 'crawler' } ``` The pipeline writes to a table with the name of the spider (Creates one if it does not already exists). Pages that were already crawled are recognized (By the URL) and ignored.

This pipeline writes the crawled items to a file called items.json in a notation similar to JSON. Filename can be changed in crawler/pipelines.py.

### Categories Not the full sites are crawled but only certain categories (Mostly politics, economy, and general news). To add categories or change them, modify the spider for the site. Open the implementation of the spider in `crawler/spiders/` and look for the `rules` variable. New categories can be added to the or case such as `$new_category` below. ```python rules = ( Rule( LinkExtractor( allow=( '(politik|gesellschaft|wirtschaft|$new_category).*\/index', 'thema\/', ) ), follow=True ), Rule( LinkExtractor( allow=('(politik|gesellschaft|wirtschaft|$new_category)(\/.+)*\/\d{4}-\d{1,2}\/.+'), deny=('-fs') ), callback='parse_page', ), ``` ## Usage ### CLI The spiders can be run individually using the command line: ```bash scrapy crawl $name ``` With `$name` being one of the following:
Site URL $name
Frankfurter Allgemeine Zeitung http://www.faz.net faz
Focus Online http://www.focus.de focus
Spiegel Online http://www.spiegel.de spiegel
Stern http://www.stern.de stern
Sueddeutsche Zeitung http://www.sueddeutsche.de sz
Die Welt http://www.welt.de welt
Zeit Online http://www.zeit.de zeit
n-tv http://www.n-tv.de ntv
Handelsblatt http://www.handelsblatt.com hb
RP Online http://ww.rp-online.de rp
### Daemon For running multiple spiders at defined intervals as a daemon refer to [scrapyd](https://scrapyd.readthedocs.org/en/latest/overview.html)

About

Collection of spiders crawling articles on major German news sites for the Scrapy application framework.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages