Crawling responsibly on the web, also means giving other a chance to indicate how they want to be crawled, if at all.
Since the unobtanium-crawler can be used for multiple purposes it allows configuring it's user agent, either in the crawl configuration or via a command line option to the run
subcommand.
If left unconfigured the crawler uses the user agent unobtanium-crawler-unconfigured-ua
, outside of temporary testing, use of this user agent is heavily discouraged.
When setting you own user agent please follow the following pattern:
{your-project}-crawler (for {url})
Where {your-project}
is a representative name of what you are using the data for and {url}
leads to a website describing or showing what you are using the data for.
An example would be:
example-net-crawler (for https://example.net/why-we-crawl-your-website)
Everything before the first space (i.e. example-net-crawler
) will be used to match against the robots.txt
file, if you are an operator you can just copy that part over.
If the crawler finds itself not addressed by name in the robots.txt it follows the rules set for all bots (*
).
The crawler re-fetches the robots.txt file every 30 minutes, so stopping a rouge crawler is possible within a reasonable timeframe. Sending it 429
response codes will also increase the crawl delay slightly every-time the reponse hits. Alternatively a crawler can be stopped by simulating an outage with 503
responses, it will give up within a few requests.
Note: Unobtanium is free software, and while responsible use is encouraged, it can be modified to not respect all of the above. Such modifications while not forbidden are heavily discouraged by the developer.