Manual: About the unobtanium.rocks Crawler

What is the Crawler for?

The crawler feeds the development instance of a fully stand alone search engine over at [https://unobtanium.rocks]. The resulting data is also used to test development versions of the unobtanium software. You can find the source on Codeberg.

How will the Crawler behave?

The crawler runs on the same machine as the frontend with public configuration.

The user agent is unobtanium.rocks (for https://unobtanium.rocks, {index} index), where {index} is the name of the configuration file without the .toml suffix that is resposible foe the request. Each user agent is it's own independent crawler process, the results are combined at the end.

In general the crawler will start at a configured seed page and follow links from there, for subsequent crawls it will usually remember the last crawl.

Crawls happen in runs that are started manually, usually every few weeks, in that time the crawler will try to dicover new pages and recrawl existing ones.

Unobtanium is a fast, but polite crawler. It will try to crawl as fast as possible while still respecting the servers limits.

The delay between requests will be at least the Crawl-Delay set in the robots.txt (capped at 180 seconds) file or a dynamically calculated minimum that is at least the time it took for the Server to respond, this way, if the Server slows down, the crawler also slows down. The crawler will react to HTTP 429 reponses. Details on the crawl delay algorithm page.

In case the connection itself fails for whatever reason the crawler will immedeately try to send a second request. This behaviour is because failing connections are suprisingly common and usually work on the second try. Unontanium calls these Fluke events.

The crawler will stop once it runs into a preconfigured limit of operations/requests per crawl run or when it finds no more crawlable pages.

Why is the Crawler trying to Access Private Resources?

Most likely this is because it found a link that points there and no robots.txt rule exists to stop it.

Pages that result in a 4xx status code will not be indexed. Though the crawler will keep checking them on every recrawl as if they were dead links.

If the matter seems bigger than some robots.txt entries to you please open an issue on the index repository.

robots.txt

In general the unobtanium crawler will use the part of its user agent up to the first space for matching the robots.txt, for the unobtanium.rocks crawler this will be unobtanium.rocks independent of whoch index the site is in.

The robots.txt file will be refected in a regular interval (~30 minutes) while crawling which makes it possible for web admins to directly stop or slow down the crawler while in case it is going where it shouldn't or is too fast.

The first request of every crawl will be to fetch the robots.txt file.

Who is Responsible for the Crawler?

Like the rest of unobtanium.rocks the unobtanium.rocks crawler is operated by Slatian.

See Also