-
Notifications
You must be signed in to change notification settings - Fork 762
Development Notes
Alex Osborne edited this page Jul 4, 2018
·
2 revisions
Most development effort is now directed at the Heritrix3 line. General information about 3.x development is available at Heritrix3 . A major theme of the 3.x releases will be enabling adaptive & continuous revisit crawling at large scale. Upcoming work towards that goal will include:
- refactoring and possibly combining the internal uri-history and already-included data structures; improving flexibility of frontier queues (offering the possibility of long-lived timing queues and multiple queues per host/exclusion-grouping)
- enabling revisit of discovered URIs according to a swappable policy, which may take into account desired revisit intervals and detected URI change rates on previous visits
Other areas of upcoming attention, though not yet scheduled for specific target releases, include:
- improving the usability and documentation of recently-added features (duplication-suppression; tunable prioritization) in typical operator workflows
- improving the automated test coverage with simulated crawling, especially for non-default feature configurations and longer/performance-intensive test runs
- better crawling of web video content with default configurations
- a web-services interface as an alternative to JMX for remote-control of the crawler
- new heuristics and knowledge-sharing for trap/spam reduction
- improved during-crawl queue-oriented reporting, including better predictions of completion times
- improved options for crawling access-controlled (password/other-auth) sites
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse