Crawl Loop

Note: The crawl loop is built into the unobtanium-crawler crawl command.

This document describes how the unobtanium Crawler works. Discovers new pages and deals with existing pages.

Inputs:

Crawler configuration
- Seed URLs
- Crawl Policies:
  - do_not_crawl
  - url_query_paramters
From Crawler Database:
- Found links and recrawl information (Crawl Candidates)

Outputs:

To Crawler Database:
- Found links and recrawl information (Crawl Candidates)
- Redirect information
- File metadata
- File content

The crawler splits itself up into one crawl loop for each URL origin that is configured (OriginCrawler), this makes internals simpler as the data for one site is neatly kept together. Which origins are crawled is derived from the seed-URLs. If a found link has the same origin as a configured seed, it will be considered for crawling.

Crawling stops after a preconfigured number of crawling actions have happened or if there are no more URLs that need crawling.

Example: If the crawler is configured for example.org and finds a link to example.com it won't crawl that link with it's example.org crawler. If example.com also happens to be configured, it will be crawled as part of that, otherwise the link will not be used.

Crawl Candidates

Crawl Candidates are all URLs that are considered for crawling along with optional recrawl information, they are kept in the Crawler Database.

Initial crawl candidates are the seed URLs and candidates from the previous crawls already in the database.

Redirects

The crawler never follows redirects, it saves the fact that an URL redirected and marks the target as a Crawl Candidate.

Crawl inhibitors

A URL might be discarded from crawling for a number of reasons:

The URL doesn't share an origin with one of the configured seeds.
The URL is excluded by a do_not_crawl policy in the crawler configuration.
The URL has a combination of query parameters not explicitly allowed by a url_query_parameters policy.
The robots.txt file denies crawling
The URL has already been crawled and is not due for a recrawl.

Crawl results will be discarded if:

The crawled resource marks itself as non-canonical, the canonical URL will be marked as a crawl candidate.
The crawled resource marks itself as not wanting to be indexed (using meta robots noindex), the unobtanium crawler respects that.
The page marks itself as not containing links that a crawler should follow (using meta robots nofollow), in this case links aren't saved to the database.

The Crawl loop

All crawlers run simultanously, alternating between doing requests and idling to not overwhelm web servers.

The Origin Crawlers have the following state relevant for crawling:

Their robots.txt information
An expiry time for the robots.txt information
A todo-list - initalized to the seed URLs
A temporary ignore list in the database
The time to wait between requests
Patience counter - initialized to 5

Each iteration the crawler does one of three things:

[1] Fetch the `robots.txt` file

If there is no robots.txt file or after expiry of the old one (after 30 minutes, currently hard coded) the origin crawler tries to fetch the /robots.txt file, parse it and store it. If no robots.txt is found the crawler assumes that it is okay to crawl.

The time to wait between requests is set to the one given in the robots.txt if it contains a crawl delay.

Note: This is implemented by the DomainInformationLibrary struct.

[2] Fill the todo list

If the todo-list is empty the crawler will query crawl-candidates from the crawl database that:

match the crawlers assigned origin
have either never been crawled or are due for a recrawl
are not on the temporary ignore list

All URLs that have been in this crawling stage are saved to a temporary ignore-list that is handled by the database to ensure no URL gets considered twice. This ensures that even if the recrawl interval is set very low the crawler "finishes" a site and only recrawls on it's next invocation, it also allows the crawler to recognize when it is finished.

If the database doesn't have any URLs for crawling the origin crawler signals that it has finished and gets removed from the scheduler.

It applies the do_not_crawl policies, if an URL is denied by such a policy the crawler logs it as a crawl with the BLOCKED_URL_BY_LOCAL_POLICY exit code.

If the URL has a query part, the url_query_parameters policies are evaluated. If no applicable policy allows the combination of URL parameters the crawler also logs the crawl as BLOCKED_URL_BY_LOCAL_POLICY.

It checks if the robots.txt, if the URL shouldn't be crawled it gets logged as a crawl with the BLOCKED_BY_ROBOTS_TXT exit code.

[3] Crawl an URL

If there is an URL on the todo-list it is taken off and handed to the scraper for fetching and crawl-time scraping, this will extract URLs relevant for the crawler, such as from links (html a elements), redirects and canonical URLs (in case a site marks itself as not canonical) and mark them as crawl candidates.

After that happened the CrawlCandidate will be updated with information necessary for the recrawl to recognize if a resource changed since the last crawl.

If the exit code indicates that the crawl has been rate-limited the tine to wait between requests is increased and the URL un-ignored so that it can be recawled in the same run.

If the error seems like a temporary condition that might have been caused by the network, the URL is un-ignored so that it can be recrawled in the same run.

If the conditions seems like it could persist the patience counter is decreased by one, when the counter reaches zero the origin crawler signals that is is "finished" because crawling with a broken connection or an otherwise offline site or server that speaks gibberish makes little sense.

If a database error is encountered the patience counter is set to zero immediately.

In any case the final crawl result is saved to the crawl log.

Notable Changes

2025-08-05: The crawlers now run concurrently as part of the Crawling at the speed of politeness goal, this version introduced the politeness based delay algorithm. Before that the non concurrent crawlers were taking turns in a round robin like scheduling pattern.