scrapy-german-news

Collection of spiders crawling articles on major German news sites for the Scrapy application framework. The spiders are usable as is, crawled data can be written to a PostgreSQL data base using the python psycopg2 package or to a file as JSON.

## Introduction This is a collection of spiders for the Scrapy application framework. The spiders cover major German news sites. The repository includes a middleware for rotating user agents and two output pipelines: Writing to a PostgreSQL data base using the psycopg2 package and writing to a file in a JSON like structure. ### Covered Sites Following German news sites have spiders: * [Frankfurter Allgemeine Zeitung](http://www.faz.net) * [Focus Online](http://www.focus.de) * [Spiegel Online](http://www.spiegel.de) * [Stern](http://www.stern.de) * [Sueddeutsche Zeitung](http://www.sueddeutsche.de) * [Die Welt](http://www.welt.de) * [Zeit Online](http://www.zeit.de) * [n-tv](http://www.n-tv.de) * [Handelsblatt](http://www.handelsblatt.com) * [RP Online](http://www.rp-online.de)

Not the full sites are covered out of the box but only certain categories, see section Categories on how to modify them.

### Retrieved Data Following information is extracted from these sites: * URL * Author * Keywords * Date published * Date crawled * Short description * Full text ## Installation The `requirements.txt` file contains a list of all required packages and can be installed using `pip`: ```bash pip install -r requirements.txt ``` Alternatively the packages can be installed by hand: ```bash pip install scrapy psycopg2 ``` Scrapy is the framework for the spiders. Psycopg2 is used for a pipeline to write the results to a PostgreSQL data base and can be skipped if the pipeline is commented out in `crawler/pipelines.py`. ### Requirements * Python 2.7 (Scrapy is not yet released for Python 3) * Scrapy 1.0 ([website](http://scrapy.org/), [github](https://github.com/scrapy/scrapy), [doc](http://doc.scrapy.org/en/1.0/))

If the data base pipeline is used (Else you can remove it from crawler/pipelines.py):

Psycopg2 (website, github,doc)
PostgreSQL 9.5 (website, wiki, doc)

## Configuration The configuration is in `crawler/settings.py`. ### Throttling As to not overload the web server with requests, a download delay of 3 seconds is set. Scrapy sends requests in an interval around this delay so it is not as easily recognizable as a crawler. To remove the delay at your own risk, comment the line: ```python DOWNLOAD_DELAY=3 ``` ### User Agent A middleware is included that allows for rotating user agents to make the crawler harder to identify as such. The middleware is implemented in `crawler/middlewares.py` and can be set on spider to spider case. To activate the rotating, uncomment the middleware: ```python DOWNLOADER_MIDDLEWARES = { # Enable for rotationg user agents 'crawler.middlewares.RotateUserAgentMiddleware': 110, } ``` The used user agents are defined below as the most commonly used agents: ```python USER_AGENT_CHOICES = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11) AppleWebKit/601.1.56 (KHTML, like Gecko) Version/9.0 Safari/601.1.56', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.7 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.7', 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko' ] ``` Change them or add your own as needed. ### Output Two different output methods are provided; writing to a PostgreSQL data base and writing to a file as JSON. The methods are implemented in `crawler/pipelines.py` To choose a method comment/uncomment the lines in `crawler/settings.py` as needed: ```python ITEM_PIPELINES = { 'crawler.pipelines.PostgresPipeline': 300, # 'crawler.pipelines.JsonWriterPipeline': 800, } ``` #### Database The settings for the data base are defined in the `crawler/settings.py`: ```python DATABASE = { 'drivername': 'postgres', 'host': 'localhost', 'port': '5432', 'username': 'postgres', 'password': 'password', 'database': 'crawler' } ``` The pipeline writes to a table with the name of the spider (Creates one if it does not already exists). Pages that were already crawled are recognized (By the URL) and ignored.

JSON

This pipeline writes the crawled items to a file called items.json in a notation similar to JSON. Filename can be changed in crawler/pipelines.py.

### Categories Not the full sites are crawled but only certain categories (Mostly politics, economy, and general news). To add categories or change them, modify the spider for the site. Open the implementation of the spider in `crawler/spiders/` and look for the `rules` variable. New categories can be added to the or case such as `$new_category` below. ```python rules = ( Rule( LinkExtractor( allow=( '(politik|gesellschaft|wirtschaft|$new_category).*\/index', 'thema\/', ) ), follow=True ), Rule( LinkExtractor( allow=('(politik|gesellschaft|wirtschaft|$new_category)(\/.+)*\/\d{4}-\d{1,2}\/.+'), deny=('-fs') ), callback='parse_page', ), ``` ## Usage ### CLI The spiders can be run individually using the command line: ```bash scrapy crawl $name ``` With `$name` being one of the following:

Site	URL	$name
Frankfurter Allgemeine Zeitung	http://www.faz.net	faz
Focus Online	http://www.focus.de	focus
Spiegel Online	http://www.spiegel.de	spiegel
Stern	http://www.stern.de	stern
Sueddeutsche Zeitung	http://www.sueddeutsche.de	sz
Die Welt	http://www.welt.de	welt
Zeit Online	http://www.zeit.de	zeit
n-tv	http://www.n-tv.de	ntv
Handelsblatt	http://www.handelsblatt.com	hb
RP Online	http://ww.rp-online.de	rp

### Daemon For running multiple spiders at defined intervals as a daemon refer to [scrapyd](https://scrapyd.readthedocs.org/en/latest/overview.html)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
crawler		crawler
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapy-german-news

Table of Contents

JSON

About

Releases

Packages

Languages

mengguiyouziyi/scrapy-german-news

Folders and files

Latest commit

History

Repository files navigation

scrapy-german-news

Table of Contents

JSON

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages