About the unobtanium.rocks Crawler

What is the Crawler for?

The crawler feeds the development instance of a fully stand alone search engine over at [https://unobtanium.rocks]. The resulting data is also used to test development versions of the unobtanium software. You can find the source on Codeberg.

How will the Crawler behave?

The crawler runs on the same machine as the frontend with public configuration.

The user agent is unobtanium.rocks (for https://unobtanium.rocks, {index} index), where {index} is the name of the configuration file without the .toml suffix that is resposible foe the request. Each user agent is it's own independent crawler process, the results are combined at the end.

In general the crawler will start at a configured seed page and follow links from there, for subsequent crawls it will usually remember the last crawl.

Crawls happen in runs that are started manually, usually every few weeks, in that time the crawler will try to dicover new pages and recrawl existing ones.

The crawler delay will be at least one second between requests if no Crawl-Delay is set in the robots.txt file. Independent of the crawl delay, expect the crawler to make one request every 180 seconds.

The crawler will stop once it runs into a preconfigured limit of operations/requests per crawl run or when it finds no more crawlable pages.

In case the connection itself fails for whatever reason the crawler will immedeately try to send a second request. This behaviour is because failing connections are suprisingly common and usually work on the second try. Unontanium calls these Fluke events.

Why is the Crawler trying to Access Private Resources?

Most likely this is because it found a link that points there and no robots.txt rule exists to stop it.

Pages that result in a 4xx status code will not be indexed. Though the crawler will keep checking them on every recrawl as if they were dead links.

If the matter seems bigger than some robots.txt entries to you please open an issue on the index repository.

robots.txt

In general the unobtanium crawler will use the part of its user agent up to the first space for matching the robots.txt, for the unobtanium.rocks crawler this will be unobtanium.rocks independent of whoch index the site is in.

The robots.txt file will be refected in a regular interval (~30 minutes) while crawling which makes it possible for web admins to directly stop or slow down the crawler while in case it is going where it shouldn't or is too fast.

The first request of every crawl will be to fetch the robots.txt file.

Who is Responsible for the Crawler?

Like the rest of unobtanium.rocks the unobtanium.rocks crawler is operated by Slatian.

Manual: About the unobtanium.rocks Crawler

What is the Crawler for?

How will the Crawler behave?

Why is the Crawler trying to Access Private Resources?

robots.txt

Who is Responsible for the Crawler?

See Also