The crawl delay is the time between requests the crawler waits to avoid causing too much load on web servers.
Choosing the crawl delay is always a compromise between not causing too much load on a server and getting the crawling done as fast as possible.
Currently unobtanium runs once crawler for each Origin, each with its own crawl loop and its own crawl delay calculation.
The crawl delay is implemented in unobtanium/crawler/src/crawler/crawl_delay.rs.
Calculating the Crawl Delay
Unobtanium chooses from multiple delays:
- A minimum delay
- A politeness based delay
Of those the longer one will be chosen as the wait time until the next request.
The Minimum Delay
The minimum delay is either the Crawl-Delay
from robots.txt
capped at 2 minutes if available or a configured minimum delay.
The Politeness based Delay
The politeness mechanism is taken from the stract crawler.
In general it is based on taking how long the last request took to respond and multiplying it by 2politeness
. The politeness factor starts at 2 and then decreases, but not below 0.
If a HTTP 429
reponse is received, the politeness factor will be incremented by 1 (doubling the wait time) and autodecrementing will be turned off until the end of the crawl run.