Skip to content

Using scrape in artoo.spiders

suntong001 edited this page Jan 20, 2015 · 4 revisions

This wiki is on how to use scrape in artoo.spiders, i.e., how to pass a configuration object for the scrape method that will be automatically called on the retrieved data by artoo.spiders.

The detailed implementation example can be found here, which is an example of using the same scraper configuration object to scrap the first and the following pages.

For the case of typical two step scraping,

  1. A list of urls is scrapped from an index page.
  2. That url list is used by artoo.spiders to trigger series of HTTP requests to collect data from those urls (content pages).
    • On successful retrieving data from each url, artoo.spiders use the scrape method with the configuration object to retrieve only the needed data from the content pages.

It will be a simple matter to define a scraper configuration object, which will be different from the one that scraps for indexes, then just pass it to artoo.spiders as shown above.

Clone this wiki locally