-
Notifications
You must be signed in to change notification settings - Fork 762
Fetch Chain Processors
Processor Name |
Description |
Class Name |
---|---|---|
preparer |
This processor prepares ACCEPTed URIs for enqueing in the Frontier. It is run again to recheck the scope of URIs before fetching begins. |
|
preconditions |
This processor verifies or triggers the fetching of prerequisite URIs. |
|
fetchDns |
This processor fetches DNS URIs. |
|
fetchHttp |
This processor fetches HTTP URIs. As of Heritrix 3.1, the crawler will now properly decode 'chunked' Transfer-Encoding -- even if encountered when it should not be used, as in a response to an HTTP/1.0 request. Additionally, the fetchHttp processor now includes the parameter 'useHTTP11', which if true, will cause Heritrix to report its requests as 'HTTP/1.1'. This allows sites to use the 'chunked' Transfer-Encoding. (The default for this parameter is false for now, and Heritrix still does not reuse a persistent connection for more than one request to a site.) |
|
extractorHttp |
This processor extracts outlinks from HTTP headers. As of Heritrix 3.1, the extractorHttp processor now considers any URI on a hostname to imply that the '/favicon.ico' from the same host should be fetched. Also, as of Heritrix 3.1, the "inferRootPage" property has been added to the extractorHttp bean. If this property is "true", Heritrix infers the '/' root page from any other URI on the same hostname. The default for this setting is "false", which means the pre-3.1 behavior of only fetching the root page if it is a seed or otherwise discovered and in-scope remains in effect. Discovery via these new heuristics is considered to be a new 'I' (inferred) hop-type, and is treated the same in scoping/transclusion decisions as an 'E' (embed). |
org.archive.modules.extractor.ExtractorHTTP |
extractorHtml |
This processor extracts outlinks from HTML content. |
org.archive.modules.extractor.ExtractorHTML |
extractorCss |
This processor extracts outlinks from CSS content. |
org.archive.modules.extractor.ExtractorCSS |
extractorJs |
This processor extracts outlinks from JavaScript content. |
org.archive.modules.extractor.ExtractorJs |
extractorSwf |
This processor extracts outlinks from Flash content. |
org.archive.modules.extractor.ExtractorSWF |
extractorPdf |
This processor extracts outlinks from PDF content. |
org.archive.modules.extractor.ExtractorPDF |
extractorXml |
This processor extracts outlinks from XML content. |
org.archive.modules.extractor.ExtractorXML |
Most extract processors are pre-fconfigured in a job's
crawler-beans.cxml
configuration file under the "fetchProcessors"
bean. To add a new extractor, such as an XML/RSS extractor, define the
bean and then link it to the "fetchProcessors" bean. An example for the
extractorXml bean is below.
-
Define the bean for the XML Extractor
<bean id="extractorXml" class="org.archive.modules.extractor.ExtractorXML"></bean>
-
Link the "extractorXml" bean to the "fetchProcessors" bean
<bean id="fetchProcessors" class="org.archive.modules.FetchChain"> <property name="processors"> <list> <!-- re-check scope, if so enabled... --> <ref bean="preselector"/> <!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... --> <ref bean="preconditions"/> <!-- ...fetch if DNS URI... --> <ref bean="fetchDns"/> <!-- <ref bean="fetchWhois"/> --> <!-- ...fetch if HTTP URI... --> <ref bean="fetchHttp"/> <!-- ...extract outlinks from HTTP headers... --> <ref bean="extractorHttp"/> <!-- ...extract outlinks from HTML content... --> <ref bean="extractorHtml"/> <!-- ************ ...extract outlinks from XML/RSS content.. ********** --> <ref bean="extractorXml"/> <!-- ...extract outlinks from CSS content... --> <ref bean="extractorCss"/> <!-- ...extract outlinks from Javascript content... --> <ref bean="extractorJs"/> <!-- ...extract outlinks from Flash content... --> <ref bean="extractorSwf"/> </list> </property> </bean>
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse