Collection of spiders crawling articles on major German news sites for the Scrapy application framework. The spiders are usable as is, crawled data can be written to a PostgreSQL data base using the python psycopg2 package or to a file as JSON.
## Introduction This is a collection of spiders for the Scrapy application framework. The spiders cover major German news sites. The repository includes a middleware for rotating user agents and two output pipelines: Writing to a PostgreSQL data base using the psycopg2 package and writing to a file in a JSON like structure. ### Covered Sites Following German news sites have spiders: * [Frankfurter Allgemeine Zeitung](http://www.faz.net) * [Focus Online](http://www.focus.de) * [Spiegel Online](http://www.spiegel.de) * [Stern](http://www.stern.de) * [Sueddeutsche Zeitung](http://www.sueddeutsche.de) * [Die Welt](http://www.welt.de) * [Zeit Online](http://www.zeit.de) * [n-tv](http://www.n-tv.de) * [Handelsblatt](http://www.handelsblatt.com) * [RP Online](http://www.rp-online.de)Not the full sites are covered out of the box but only certain categories, see section Categories on how to modify them.
### Retrieved Data Following information is extracted from these sites: * URL * Author * Keywords * Date published * Date crawled * Short description * Full text ## Installation The `requirements.txt` file contains a list of all required packages and can be installed using `pip`: ```bash pip install -r requirements.txt ``` Alternatively the packages can be installed by hand: ```bash pip install scrapy psycopg2 ``` Scrapy is the framework for the spiders. Psycopg2 is used for a pipeline to write the results to a PostgreSQL data base and can be skipped if the pipeline is commented out in `crawler/pipelines.py`. ### Requirements * Python 2.7 (Scrapy is not yet released for Python 3) * Scrapy 1.0 ([website](http://scrapy.org/), [github](https://github.com/scrapy/scrapy), [doc](http://doc.scrapy.org/en/1.0/))If the data base pipeline is used (Else you can remove it from crawler/pipelines.py
):
This pipeline writes the crawled items to a file called items.json
in a notation similar to JSON.
Filename can be changed in crawler/pipelines.py
.
Site | URL | $name |
---|---|---|
Frankfurter Allgemeine Zeitung | http://www.faz.net | faz |
Focus Online | http://www.focus.de | focus |
Spiegel Online | http://www.spiegel.de | spiegel |
Stern | http://www.stern.de | stern |
Sueddeutsche Zeitung | http://www.sueddeutsche.de | sz |
Die Welt | http://www.welt.de | welt |
Zeit Online | http://www.zeit.de | zeit |
n-tv | http://www.n-tv.de | ntv |
Handelsblatt | http://www.handelsblatt.com | hb |
RP Online | http://ww.rp-online.de | rp |