-
Notifications
You must be signed in to change notification settings - Fork 762
Adding URIs mid crawl
(This is written for Heritrix 1.14.x; procedures for H2/H3 will vary.)
There are two primary ways to add URIs to a crawl after it has begun with the Web UI: changing the seed source file (seeds.txt) and triggering a rescan, or using the 'import URIs' option from the 'View/Edit Frontier' page (only available in paused crawls).
The seeds file is by default reread after any settings change (even small unrelated changes); you can change this via the scope's 'reread-seeds-on-reconfig' setting.
(You can edit smallish seed files in the textarea at the bottom of all other editable settings. Large seed files are awkward or impossible to edit via the Web UI's text area; in such a case you can still edit the file on disk via other methods.)
Rescanning the seeds file triggers an attempted rescheduling of all contained URIs. However, those URIs that have already been scheduled will not be rescheduled. Thus, only URIs new to the seeds file will be newly scheduled. (And if, for some reason those new URIs were already scheduled by some other discovery-path, they won't be rescheduled.)
Scopes which are defined in terms of seeds are also by default rebuilt from seeds on settings changes. (So if using such a scope, you would generally not want to remove original seeds even though they are already scheduled -- otherwise your scope could be redefined to exclude the corresponding sites.)
When the crawler is paused, a 'View or Edit Frontier' link will appear in the console. The corresponding page offers an 'import URIs' form.
This form expects a file path local to the crawler; the file may be in any of the three formats listed (a URI per line, a crawl.log, or an uncompressed recovery log). If the supplied format includes 'hops-path' and 'via' information, the imported URIs will share that information. If the 'force revisit' box is checked, the supplied URIs will be force-scheduled, even if they were previously scheduled.
None of the import-URI options cause URIs to be treated as seeds or change the scope in any way. (For example, you could force-schedule an out-of-scope URI -- but if you still have the default 'Preselector' processor that rechecks scope, it would still be rejected from crawling when its turn comes up.)
The JMX remote-control interface includes importUriand importUris operations on the CrawlJob bean that mimic the WebUI's import URIs function. An example:
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse