What is the Crawler for?
The crawler feeds the development instance of a fully stand alone search engine over at [https://unobtanium.rocks]. The resulting data is also used to test development versions of the unobtanium software. You can find the source on Codeberg.
How will the Crawler behave?
The crawler runs on the same machine as the frontend with public configuration.
The user agent is unobtanium.rocks (for https://unobtanium.rocks, {index} index)
, where {index}
is the name of the configuration file without the .toml
suffix that is resposible foe the request. Each user agent is it's own independent crawler process, the results are combined at the end.
In general the crawler will start at a configured seed page and follow links from there, for subsequent crawls it will usually remember the last crawl.
Crawls happen in runs that are started manually, usually every few weeks, in that time the crawler will try to dicover new pages and recrawl existing ones.
The crawler delay will be at least one second between requests if no Crawl-Delay
is set in the robots.txt
file. Independent of the crawl delay, expect the crawler to make one request every 180 seconds.
The crawler will stop once it runs into a preconfigured limit of operations/requests per crawl run or when it finds no more crawlable pages.
In case the connection itself fails for whatever reason the crawler will immedeately try to send a second request. This behaviour is because failing connections are suprisingly common and usually work on the second try. Unontanium calls these Fluke events.
Why is the Crawler trying to Access Private Resources?
Most likely this is because it found a link that points there and no robots.txt
rule exists to stop it.
Pages that result in a 4xx
status code will not be indexed. Though the crawler will keep checking them on every recrawl as if they were dead links.
If the matter seems bigger than some robots.txt entries to you please open an issue on the index repository.
robots.txt
In general the unobtanium crawler will use the part of its user agent up to the first space for matching the robots.txt
, for the unobtanium.rocks crawler this will be unobtanium.rocks
independent of whoch index the site is in.
The robots.txt
file will be refected in a regular interval (~30 minutes) while crawling which makes it possible for web admins to directly stop or slow down the crawler while in case it is going where it shouldn't or is too fast.
The first request of every crawl will be to fetch the robots.txt file.
Who is Responsible for the Crawler?
Like the rest of unobtanium.rocks the unobtanium.rocks crawler is operated by Slatian.