So you've heard about unobtanium and want to host your own instance of it? This guide is for you.
Note: Unobtanium is unfinished software, be prepared for a fair bit of tinkering.
Warning: This guide is incomplete, but should be enough to get you started.
I'll assume you already know a few things:
- Linux administration in general
- Configuring a web reverse proxy
- HTML
- SQL
- TOML
- How to use
cargo build
to build your own binaries - Shell scripting (for your own convenience)
If you haven't yet please have a look at the overview page.
Resource requirements
- CPU
- No special requirements, it should have at least 2 cores.
- RAM
- Plan at least 2GB for unobtanium alone, this includes some buffer for caching and making sure nothing runs out of memory. In prctice its very likely you'll use less, but not having the RAM free for the Kernel to use as cache will noticeably slow down searching.
- Disk
- For a small index of one medium size blog 100MB shoukd be enough.
- For large deployments plan ~10GB per 100K searchable pages for the crawler database and another 6GB per 100K pages for one summary database (in prctice you'll probably want two of them to get new data in without downtime).
What you need to do, overview
Bootstrapping a search engine:
- Writing and testing an initial crawler configuration
- Expanding the configuration
- Doing a full initial crawl and summary
- Setting up the frontend
Maintainence:
- Semi regular recrawls
- Updating the crawler configuration
- Keeping unobtanium updated
Your first crawler configuration
Have the crawler crawl configuration manual ready.
This file will define where the crawler is allowed to collect pages from.
You may find the configuration of unobtanium.rocks useful. Use the shared policies file with the --policy-file
option, it avoids some common parts of websites that the crawler should avoid.
Hint: Start with a small configuration of one to three seeds and a crawler command limit of 1000. This will quickly give you a feel for how the crawler behaves and what it tries to collect.
After the first successful crawl try running the summarizer and the pointing a locally running viewer at the resulting summary database. Try typing in some keywords you'd expect to find results for.
Once that works you can expand the crawling configuration and raise the crawler command limit.