This page documents the configuration for the unobtanium-crawler crawl
subcommand that is given with the --config
option.
Examples of working configuration files can be found in the unobtanium. rocks index configuration.
Available settings are:
name
(optional)- Human readable name of the index for documentation purposes
description
(optional)- Human readable description of the index for documentation purposes. When writing rich text here format it as markdown.
database_file
(optional)- If given, the crawler database files path must end with the same path segments as given here, otherwise the crawler will throw an error. This is part of a safety mechanism to prevent mixing up multiple databases.
- The
--ignore-db-name-from-config
option can be used to ignore this setting. default_delay_ms
- The default crawl delay in milliseconds between requests to the same site, if the sites
robots.txt
didn't request a different one. Setting this from1000
to2000
milliseconds should be a sane default. Setting more here is more polite, but it will take longer to crawl. max_commands_per_rum
- The number of crawling commands the crawler will execute before exiting automatically. A crawling command is roughly equivalent to one request. This stops the crawler from running too long. Setting this to
60
to100
per indexed site is a sane default. - The
--max-commands
option can be used to override this setting. recrawl_interval
- How long the crawler will wait until it schedules a site that has already been crawled successfully for being crawled again in case it was updated. A sane default would be
4 wweks
but can be more or less depending on the usecase. Guaranteed supported suffixes aresecond
,minute
,hour
,day
,week
,month
,year
. Other suffixes might work but are not guaranteed to be supported long term. - The
--force-recrawl
option can be used to ignore this setting. user_agent
(optional, but highly recommended)- Configure the user agent that will be used while crawling. See the Crawler User-Agent and robots.txt page for more information.
- Can be overridden using the
--user-agent
option. seeds
- List of URLs that are the entry points for the crawler. The crawler will start at these and follow links from there given they point to an origin that is part of one of these seed URLs. See the Crawling Loop algorithm page for more information.
- Can be overridden using the
--schedule
option. do_not_crawl
- List of policies to limit which pages are crawled on top of a sites own
robots.txt
.
Policies
Policies are lists of configuration objects. The canonical way to write them is to use thee Array of Tables TOML syntax at the end of the file.
See the Crawling Loop algorithm page on the impact of crawl policies.
do_not_crawl
The do_not_crawl
Policy can inhibit crawling when a Crawl Policy Criterium matches.
Settings for the do not crawl policy are:
reason
- The reason this policy exists, usually a human readable version of the rule.
if
- The serialised criterium chain carrying crawl policy criteria that will prevent crawling if matched.
url_query_paramters
This crawl policy can be used to allow a combination of URL query parameters when a Crawl Policy Criterium matches.
By default URLs with query parameters are ignored as the usually point to page variants that are not useful for search applications.
Not all of the allowed parameters have to be present. If additional parameters are present, that URL won't be crawled unless another rule allows that combination.
Rules do not combine their allow lists. Example: If one rule for a page only allows foo
and another only allows bar
an URL containing both foo
and bar
parameters is still not allowed.
Settings for the URL query parameters policy are:
reason
- The reason this policy exists, usually a why this combination of query parameters is allowed.
allow
- List of URL query parameters to allow. (Example:
allow = ["pagge"]
) if
- The serialised criterium chain carrying crawl policy criteria that will activate this policy when matched.
Crawl Policy Criterium
The crawl policy criterium can match page metadata that is available before the request happened at the scheduling stage.
The crawl policy criterium has the follwing variants
url
- Takes a URL Criterium as argument.
- Matches if the URL of the to be crawled page matches the given URL criterium.