Skip to content

Responsible Crawling

Alex Osborne edited this page Jul 4, 2018 · 2 revisions

Responsible crawling means following the laws and established conventions of web crawling, to minimize the costs the crawling imposes on collected sites.

Key practices include:

  • respecting robots.txt except when you've been given explicit permission to do otherwise
  • providing contact information in your User-Agent, and responding promptly to all contacts
  • using politeness-delay and other settings that effect the frequency of hits to a single site to ensure most of site's serving capacity remains available for other visitors
  • regularly monitoring crawler logs for evidence of unproductive/endless paths (traps), and actively adjusting the crawler to stop such activity when observed

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally