This repository contains a stand alone library for the parsing of robots.txt files, and application of the robots exclusion protocol.
The robots.txt
file format is part of the Robots exclusion standard.
The protocol is a convention rather than formally agreed standard, and as such implementations
vary considerably. To formalise our approach we decided to adopt Google's interpretation of the
protocol, which is publicly available here: Google Robots.txt Specifications.
Example robots.txt file
User-Agent: *
Disallow: /harming/humans
Disallow: /ignoring/human/orders
Disallow: /harm/to/self
(from http://www.last.fm/robots.txt)
This module handles the entire robots.txt processing pipeline; from the identification and downloading of a robots URI, to applying path exclusion directives. The following sub-sections offer a brief description of all tasks handled by the module.
Given the URI of some resource we wish to crawl, identify the URI of the associated robots.txt file.
For example http://example.com/path/to/resource.html
is associated with
http://example.com:80/robots.txt
. Every resource has at-most one associated robots.txt, found
in the root path for the resources scheme, host, and port. See
Google Robots.txt Specifications for further details.
Once a robots.txt has been identified, the file is downloaded using an appropriate protocol. If an I/O error occurs we interpret that as an allow directive for the resource being queried. We also constrain the maximum download size.
Resources are parsed using a fast LL1 top-down parser, built in JavaCC. The process is relatively forgiving, allowing undefined field names, though can still fail if something entirely unexpected happens.
For obvious reasons, we don't want to re-acquire the robots.txt file for every query, so we cache results for pre-defined period (currently 2 days). The cache is also size limited so memory usage is bounded.
The robots.txt file can contain user-agent and path expressions, containing wild-cards and other non-literal syntax. The module translates these expressions to efficient regular expressions.
The robots file can contain multiple agent groups, defined by one or more user-agent directives. The module finds the group that best matches our crawlers user agent string. In the case that multiple groups match, we choose the most precise match (longest matching expression).
Once we have a group, we then match all it's path expressions to the queried resource URI. The matching happens in the order presented in the file, and stops on the first match. If the first match is an allow directive, the queried resource is allowed, otherwise the resource is disallowed. If no path expressions match, then the resource is allowed.
In addition to the standard exclusion rules, robots.txt sometimes contains non-standard directives. These include site-map URIs, and crawler delay instructions. The module parses these directives and holds them in the internal model, but does nothing more. If we wish to make use of these directives in the future, it will be trivial to extend the module functionality to do so.
- Oracle JDK 1.6
- Maven 3
Production Dependencies:
- Google Guava 14
- SLF4J 1.7
Testing Dependencies:
- JUnit 4.11
- Mockito 1.9.5
- Hamcrest 1.3
To build the artifact, simply run maven:
mvn clean install
Note that the library contains a JavaCC parser, which must be generated before normal compilation. Maven will handle this auto-magically, but your chosen IDE will most-likely flounder. Consult Google.
Include the library dependency in your maven pom.xml
:
<dependency>
<groupId>com.brandwatch</groupId>
<artifactId>robots-core</artifactId>
<version>1.1.0</version>
</dependency>
Configure the service using an instance of RobotsConfig
, then instantiate the RobotsService
instance from the config object. RobotsService
has exactly one method isAllowed
, which
takes a crawler agent string, and a resource URI, returning true if the given crawler is
allowed to access the given resource.
RobotsConfig config = new RobotsConfig();
config.setCachedExpiresHours(48);
config.setCacheMaxSizeRecords(10000);
config.setMaxFileSizeBytes(192 * 1024);
RobotsFactory factory = new RobotsFactory(config)
RobotsService service = factory.createService();
String crawlerAgent = "magpie-crawler/1.1 (U; Linux x86_64; en-GB; +http://www.brandwatch.net)";
...
URI resource = URI.create("https://example.com/path/to/some/file.html");
if(service.isAllowed(crawlerAgent, resource)) {
// crawl away!
}
The codebase includes a command-line interface to query exclusions of resources. The CLI is in a separate module (robots-cli), which is not part of the core library.
Currently there is no CLI distribution at all, so you'll need to build it from source all by yourself.
git clone [email protected]:BrandwatchLtd/robots.git
cd robots
mvn clean package
cd cli/target
tar xvfz robots-cli-[version]-bin-with-deps.tar.gz
cd robots-cli-[version]
From the extracted directory, simply run the robots scripts followed by a resource URI:
./robots http://last.fm/harming/humans
http://last.fm/harming/humans: disallowed
You can query multiple resources at once:
./robots http://www.brandwatch.com/index.html https://app.brandwatch.com/index.html http://www.brandwatch.com/wp-admin/
http://www.brandwatch.com/index.html: allowed
https://app.brandwatch.com/index.html: disallowed
http://www.brandwatch.com/wp-admin/: disallowed
There are also a bunch of parameters you can twiddle with:
./robots --agent "iisbot/1.0 (+http://www.iis.net/iisbot.html)" http://www.brandwatch.com/wp-admin/
http://www.brandwatch.com/wp-admin/: allowed
For a full list of parameters see the helpful help:
$ ./robots --help
Usage: robots [options] RESOURCES
Options:
--agent, -a
User agent identifier. Sent to the host on retrieval of robots.txt, and
also used for directive group matching.
Default: <unnamed-agent>
--defaultCharset, --charset, -c
Preferred character encoding for reading robots.txt. Used when server
doesn't specify encoding.
Default: UTF-8
--help, -h
Display this helpful message.
Default: false
--maxFileSizeBytes, -s
Download size limit. robots.txt retrieval will give up beyond this point.
Default: 196608
--maxRedirectHops, -r
Number of HTTP 3XX (redirection) responses to follow before giving up.
Default: 5
--readTimeout, -t
Time in millis before the client times out while downloading a response payload.
Default: 30000