Algorithm: Crawl Loop

This document describes how the unobtanium Crawler works. Discovers new pages and deals with existing pages.

Inputs:

Outputs:

The crawler splits itself up into one crawl loop for each URL origin that is configured (OriginCrawler), this makes internals simpler as the data for one site is neatly kept together. Which origins are crawled is derived from the seed-URLs. If a found link has the same origin as a configured seed, it will be considered for crawling.

Crawling stops after a preconfigured number of crawling actions have happened or if there are no more URLs that need crawling.

Example: If the crawler is configured for example.org and finds a link to example.com it won't crawl that link with it's example.org crawler. If example.com also happens to be configured, it will be crawled as part of that, otherwise the link will not be used.

Crawl Candidates

Crawl Candidates are all URLs that are considered for crawling along with optional recrawl information, they are kept in the Crawler Database.

Initial crawl candidates are the seed URLs and candidates from the previous crawls already in the database.

Redirects

The crawler never follows redirects, it saves the fact that an URL redirected and marks the target as a Crawl Candidate.

Crawl inhibitors

A URL might be discarded from crawling for a number of reasons:

Crawl results will be discarded if:

The Crawl loop

All of the origin crawlers take turns at crawling, this way the crawler can go as fast as possible without concurrency while not hammering the crawled sites, the scheduler will ensure a minimum delay between each request to a given origin.

The Origin Crawlers have the following state relevant for crawling:

Each turn the crawler does one of three things:

[1] Fetch the robots.txt file

If there is no robots.txt file or after expiry of the old one (after 30 minutes, currently hard coded) the origin crawler tries to fetch the /robots.txt file, parse it and store it. If no robots.txt is found the crawler assumes that it is okay to crawl.

The time to wait between requests is set to the one given in the robots.txt if it contains a crawl delay.

Note: This is implemented by the DomainInformationLibrary struct.

[2] Fill the todo list

If the todo-list is empty the crawler will query crawl-candidates from the crawl database that:

All URLs that have been in this crawling stage are saved to a temporary ignore-list that is handled by the database to ensure no URL gets considered twice. This ensures that even if the recrawl interval is set very low the crawler "finishes" a site and only recrawls on it's next invocation, it also allows the crawler to recognize when it is finished.

If the database doesn't have any URLs for crawling the origin crawler signals that it has finished and gets removed from the scheduler.

It applies the do_not_crawl policies, if an URL is denied by such a policy the crawler logs it as a crawl with the BLOCKED_URL_BY_LOCAL_POLICY exit code.

If the URL has a query part, the url_query_parameters policies are evaluated. If no applicable policy allows the combination of URL parameters the crawler also logs the crawl as BLOCKED_URL_BY_LOCAL_POLICY.

It checks if the robots.txt, if the URL shouldn't be crawled it gets logged as a crawl with the BLOCKED_BY_ROBOTS_TXT exit code.

[3] Crawl an URL

If there is an URL on the todo-list it is taken off and handed to the scraper for fetching and crawl-time scraping, this will extract URLs relevant for the crawler, such as from links (html a elements), redirects and canonical URLs (in case a site marks itself as not canonical) and mark them as crawl candidates.

After that happened the CrawlCandidate will be updated with information necessary for the recrawl to recognize if a resource changed since the last crawl.

If the exit code indicates that the crawl has been rate-limited the tine to wait between requests is increased and the URL un-ignored so that it can be recawled in the same run.

If the error seems like a temporary condition that might have been caused by the network, the URL is un-ignored so that it can be recrawled in the same run.

If the conditions seems like it could persist the patience counter is decreased by one, when the counter reaches zero the origin crawler signals that is is "finished" because crawling with a broken connection or an otherwise offline site or server that speaks gibberish makes little sense.

If a database error is encountered the patience counter is set to zero immediately.