Manual: Crawler crawl configuration

This page documents the configuration for the unobtanium-crawler crawl subcommand that is given with the --config option.

Examples of working configuration files can be found in the unobtanium. rocks index configuration.

Available settings are:

name (optional)
Human readable name of the index for documentation purposes
description (optional)
Human readable description of the index for documentation purposes. When writing rich text here format it as markdown.
database_file (optional)
If given, the crawler database files path must end with the same path segments as given here, otherwise the crawler will throw an error. This is part of a safety mechanism to prevent mixing up multiple databases.
The --ignore-db-name-from-config option can be used to ignore this setting.
default_delay_ms
The default crawl delay in milliseconds between requests to the same site, if the sites robots.txt didn't request a different one. Setting this from 1000 to 2000 milliseconds should be a sane default. Setting more here is more polite, but it will take longer to crawl.
max_commands_per_rum
The number of crawling commands the crawler will execute before exiting automatically. A crawling command is roughly equivalent to one request. This stops the crawler from running too long. Setting this to 60 to 100 per indexed site is a sane default.
The --max-commands option can be used to override this setting.
recrawl_interval
How long the crawler will wait until it schedules a site that has already been crawled successfully for being crawled again in case it was updated. A sane default would be 4 wweks but can be more or less depending on the usecase. Guaranteed supported suffixes are second, minute, hour, day, week, month, year. Other suffixes might work but are not guaranteed to be supported long term.
The --force-recrawl option can be used to ignore this setting.
user_agent (optional, but highly recommended)
Configure the user agent that will be used while crawling. See the Crawler User-Agent and robots.txt page for more information.
Can be overridden using the --user-agent option.
seeds
List of URLs that are the entry points for the crawler. The crawler will start at these and follow links from there given they point to an origin that is part of one of these seed URLs. See the Crawling Loop algorithm page for more information.
Can be overridden using the --schedule option.
do_not_crawl
List of policies to limit which pages are crawled on top of a sites own robots.txt.

Policies

Policies are lists of configuration objects. The canonical way to write them is to use thee Array of Tables TOML syntax at the end of the file.

See the Crawling Loop algorithm page on the impact of crawl policies.

do_not_crawl

The do_not_crawl Policy can inhibit crawling when a Crawl Policy Criterium matches.

Settings for the do not crawl policy are:

reason
The reason this policy exists, usually a human readable version of the rule.
if
The serialised criterium chain carrying crawl policy criteria that will prevent crawling if matched.

url_query_paramters

This crawl policy can be used to allow a combination of URL query parameters when a Crawl Policy Criterium matches.

By default URLs with query parameters are ignored as the usually point to page variants that are not useful for search applications.

Not all of the allowed parameters have to be present. If additional parameters are present, that URL won't be crawled unless another rule allows that combination.

Rules do not combine their allow lists. Example: If one rule for a page only allows foo and another only allows bar an URL containing both foo and bar parameters is still not allowed.

Settings for the URL query parameters policy are:

reason
The reason this policy exists, usually a why this combination of query parameters is allowed.
allow
List of URL query parameters to allow. (Example: allow = ["pagge"])
if
The serialised criterium chain carrying crawl policy criteria that will activate this policy when matched.

Crawl Policy Criterium

The crawl policy criterium can match page metadata that is available before the request happened at the scheduling stage.

The crawl policy criterium has the follwing variants

url
Takes a URL Criterium as argument.
Matches if the URL of the to be crawled page matches the given URL criterium.