-
Notifications
You must be signed in to change notification settings - Fork 762
Only Store Successful HTML Pages
Suppose you want to only capture the first 50 pages encountered from a set of seeds and archive only those pages that return a 200 response code and a text/html mime type. Additionally, you only want to look for links in HTML resources.
In order to examine only HTML documents for links, you will need to remove the following extractors that tell Hertirix to look for links in style sheets, JavaScript, and Flash files:
- ExtractorCss
- ExtractorJs
- ExtractorSwf
Leave the ExtractorHttp since it is useful for locating resources that can only be found using a redirect (301 or 302 response code).
You can limit the number of URIs downloaded by setting the maxDocumentsDownload property on the crawlLimiter bean. Setting the value to 50 will probably not provide the intended results. Since each DNS response and robots.txt file is counted in the number, you should set the value to 50 * number of seeds * 2.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse