This page documents the configuration for the unobtanium-crawler crawl subcommand that is given with the --config option.
Examples of working configuration files can be found in the unobtanium. rocks index configuration.
Available settings are:
name(optional)- Human readable name of the index for documentation purposes
description(optional)- Human readable description of the index for documentation purposes. When writing rich text here format it as markdown.
database_file(optional)- If given, the crawler database files path must end with the same path segments as given here, otherwise the crawler will throw an error. This is part of a safety mechanism to prevent mixing up multiple databases.
- The
--ignore-db-name-from-configoption can be used to ignore this setting. default_delay_ms- The default crawl delay in milliseconds between requests to the same site, if the sites
robots.txtdidn't request a different one. Setting this from1000to2000milliseconds should be a sane default. Setting more here is more polite, but it will take longer to crawl. max_commands_per_run- The number of crawling commands the crawler will execute before exiting automatically. A crawling command is roughly equivalent to one request. This stops the crawler from running too long. Setting this to
60to100per indexed site is a sane default. - The
--max-commandsoption can be used to override this setting. recrawl_interval- How long the crawler will wait until it schedules a site that has already been crawled successfully for being crawled again in case it was updated. A sane default would be
4 wweksbut can be more or less depending on the usecase. Guaranteed supported suffixes aresecond,minute,hour,day,week,month,year. Other suffixes might work but are not guaranteed to be supported long term. - The
--force-recrawloption can be used to ignore this setting. user_agent(optional, but highly recommended)- Configure the user agent that will be used while crawling. See the Crawler User-Agent and robots.txt page for more information.
- Can be overridden using the
--user-agentoption. seeds- List of URLs that are the entry points for the crawler. The crawler will start at these and follow links from there given they point to an origin that is part of one of these seed URLs. See the Crawling Loop algorithm page for more information.
- Can be overridden using the
--scheduleoption. do_not_crawl- List of policies to limit which pages are crawled on top of a sites own
robots.txt.
Policies
Policies are lists of configuration objects. The canonical way to write them is to use thee Array of Tables TOML syntax at the end of the file.
See the Crawling Loop algorithm page on the impact of crawl policies.
do_not_crawl
The do_not_crawl Policy can inhibit crawling when a Crawl Policy Criterium matches.
Settings for the do not crawl policy are:
reason- Documents the reason this policy exists, usually a human readable version of the rule.
if-
The serialised criterium chain carrying crawl policy criteria that will prevent crawling if matched. (Example:
if.url.path.has_prefix = "login")
url_query_paramters
This crawl policy can be used to allow a combination of URL query parameters when a Crawl Policy Criterium matches.
By default URLs with query parameters are ignored as the usually point to page variants that are not useful for search applications.
Not all of the allowed parameters have to be present. If additional parameters are present, that URL won't be crawled unless another rule allows that combination.
Rules do not combine their allow lists. Example: If one rule for a page only allows foo and another only allows bar an URL containing both foo and bar parameters is still not allowed.
Settings for the URL query parameters policy are:
reason- The reason this policy exists, usually a why this combination of query parameters is allowed.
allow- List of URL query parameters to allow. (Example:
allow = ["page"]) if- The serialised criterium chain carrying crawl policy criteria that will activate this policy when matched.
Crawl Policy Criterium
The crawl policy criterium can match page metadata that is available before the request happened at the scheduling stage.
The crawl policy criterium has the follwing variants
url- Takes a URL Criterium as argument.
- Matches if the URL of the to be crawled page matches the given URL criterium.