This document describes how the unobtanium Crawler works. Discovers new pages and deals with existing pages.
Inputs:
- Crawler configuration
- Seed URLs
- Crawl Policies:
do_not_crawl
- From Crawler Database:
- Found links and recrawl information (Crawl Candidates)
Outputs:
- To Crawler Database:
- Found links and recrawl information (Crawl Candidates)
- Redirect information
- File metadata
- File content
The crawler splits itself up into one crawl loop for each URL origin that is configured (OriginCrawler), this makes internals simpler as the data for one site is neatly kept together. Which origins are crawled is derived from the seed-URLs. If a found link has the same origin as a configured seed, it will be considered for crawling.
Crawling stops after a preconfigured number of crawling actions have happened or if there are no more URLs that need crawling.
Example: If the crawler is configured for example.org
and finds a link to example.com
it won't crawl that link with it's example.org
crawler. If example.com
also happens to be configured, it will be crawled as part of that, otherwise the link will not be used.
Crawl Candidates
Crawl Candidates are all URLs that are considered for crawling along with optional recrawl information, they are kept in the Crawler Database.
Initial crawl candidates are the seed URLs and candidates from the previous crawls already in the database.
Redirects
The crawler never follows redirects, it saves the fact that an URL redirected and marks the target as a Crawl Candidate.
Crawl inhibitors
A URL might be discarded from crawling for a number of reasons:
- The URL doesn't share an origin with one of the configured seeds.
- The URL is excluded by a
do_not_crawl
policy in the crawler configuration. - The URL has a combination of query parameters not explicitly allowed by a
url_query_parameters
policy. - The
robots.txt
file denies crawling - The URL has already been crawled and is not due for a recrawl.
Crawl results will be discarded if:
- The crawled resource marks itself as non-canonical, the canonical URL will be marked as a crawl candidate.
- The crawled resource marks itself as not wanting to be indexed (using
meta
robots
noindex
), the unobtanium crawler respects that. - The page marks itself as not containing links that a crawler should follow (using
meta
robots
nofollow
), in this case links aren't saved to the database.
The Crawl loop
All of the origin crawlers take turns at crawling, this way the crawler can go as fast as possible without concurrency while not hammering the crawled sites, the scheduler will ensure a minimum delay between each request to a given origin.
The Origin Crawlers have the following state relevant for crawling:
- Their robots.txt information
- An expiry time for the robots.txt information
- A todo-list - initalized to the seed URLs
- A temporary ignore list in the database
- The time to wait between requests
- Patience counter - initialized to
5
Each turn the crawler does one of three things:
[1] Fetch the robots.txt
file
If there is no robots.txt file or after expiry of the old one (after 30 minutes, currently hard coded) the origin crawler tries to fetch the /robots.txt
file, parse it and store it. If no robots.txt is found the crawler assumes that it is okay to crawl.
The time to wait between requests is set to the one given in the robots.txt if it contains a crawl delay.
Note: This is implemented by the DomainInformationLibrary
struct.
[2] Fill the todo list
If the todo-list is empty the crawler will query crawl-candidates from the crawl database that:
- match the crawlers assigned origin
- have either never been crawled or are due for a recrawl
- are not on the temporary ignore list
All URLs that have been in this crawling stage are saved to a temporary ignore-list that is handled by the database to ensure no URL gets considered twice. This ensures that even if the recrawl interval is set very low the crawler "finishes" a site and only recrawls on it's next invocation, it also allows the crawler to recognize when it is finished.
If the database doesn't have any URLs for crawling the origin crawler signals that it has finished and gets removed from the scheduler.
It applies the do_not_crawl
policies, if an URL is denied by such a policy the crawler logs it as a crawl with the BLOCKED_URL_BY_LOCAL_POLICY
exit code.
If the URL has a query part, the url_query_parameters
policies are evaluated. If no applicable policy allows the combination of URL parameters the crawler also logs the crawl as BLOCKED_URL_BY_LOCAL_POLICY
.
It checks if the robots.txt
, if the URL shouldn't be crawled it gets logged as a crawl with the BLOCKED_BY_ROBOTS_TXT
exit code.
[3] Crawl an URL
If there is an URL on the todo-list it is taken off and handed to the scraper for fetching and crawl-time scraping, this will extract URLs relevant for the crawler, such as from links (html a
elements), redirects and canonical URLs (in case a site marks itself as not canonical) and mark them as crawl candidates.
After that happened the CrawlCandidate will be updated with information necessary for the recrawl to recognize if a resource changed since the last crawl.
If the exit code indicates that the crawl has been rate-limited the tine to wait between requests is increased and the URL un-ignored so that it can be recawled in the same run.
If the error seems like a temporary condition that might have been caused by the network, the URL is un-ignored so that it can be recrawled in the same run.
If the conditions seems like it could persist the patience counter is decreased by one, when the counter reaches zero the origin crawler signals that is is "finished" because crawling with a broken connection or an otherwise offline site or server that speaks gibberish makes little sense.
If a database error is encountered the patience counter is set to zero immediately.