Manual: Selfhosting Unobtanium

So you've heard about unobtanium and want to host your own instance of it? This guide is for you.

Note: Unobtanium is unfinished software, be prepared for a fair bit of tinkering.

Warning: This guide is incomplete, but should be enough to get you started.

I'll assume you already know a few things:

If you haven't yet please have a look at the overview page.

Resource requirements

CPU
No special requirements, it should have at least 2 cores.
RAM
Plan at least 2GB for unobtanium alone, this includes some buffer for caching and making sure nothing runs out of memory. In prctice its very likely you'll use less, but not having the RAM free for the Kernel to use as cache will noticeably slow down searching.
Disk
For a small index of one medium size blog 100MB shoukd be enough.
For large deployments plan ~10GB per 100K searchable pages for the crawler database and another 6GB per 100K pages for one summary database (in prctice you'll probably want two of them to get new data in without downtime).

What you need to do, overview

Bootstrapping a search engine:

  1. Writing and testing an initial crawler configuration
  2. Expanding the configuration
  3. Doing a full initial crawl and summary
  4. Setting up the frontend

Maintainence:

Your first crawler configuration

Have the crawler crawl configuration manual ready.

This file will define where the crawler is allowed to collect pages from.

You may find the configuration of unobtanium.rocks useful. Use the shared policies file with the --policy-file option, it avoids some common parts of websites that the crawler should avoid.

Hint: Start with a small configuration of one to three seeds and a crawler command limit of 1000. This will quickly give you a feel for how the crawler behaves and what it tries to collect.

After the first successful crawl try running the summarizer and the pointing a locally running viewer at the resulting summary database. Try typing in some keywords you'd expect to find results for.

Once that works you can expand the crawling configuration and raise the crawler command limit.