-
Notifications
You must be signed in to change notification settings - Fork 762
Multiple Machine Crawling
Multi-machine crawls can be performed using the hashCrawlMapper to assign URIs to different machines based on their hostnames.
First add the following bean:
<bean id="hashCrawlMapper">
<property name="enabled" value="true" />
<property name="localName" value="0" />
<property name="diversionDir" value="diversions" />
<property name="checkUri" value="true" />
<property name="checkOutlinks" value="false"/>
<property name="rotationDigits" value="10" />
<!-- Number of crawlers being used in the multi-machine setup-->
<property name="crawlerCount" value="7" />
</bean>
Call the bean in the CandidateProcessor's chain:
<bean id="candidateProcessors">
<property name="processors">
<list>
<!-- apply scoping rules to each individual candidate URI... -->
<ref bean="candidateScoper"/>
<!-- Check URIs for crawler assignment -->
<ref bean="hashCrawlMapper"/>
All crawlers will receive the same initial list of seeds. The only value which should change between different crawlers is "localName".
When the crawl starts, the "diversions" directory will start to fill with .divert files. Each .divert file will be named $timestamp-$localname_currentmachine-to-$localname_assignedmachine.
Crawl operators must set up a process where the the URIs contained in .divert files are copied from each crawler to their assigned crawlers and queued into the active crawl. For example, converting the .divert file into the format expected by the action directory as a .schedule file would be sufficient.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse