Your first search engine

This is a step by step guide on setting up your first unobtanium search instance from installing the dependencies to getting first search results.

This guide assumes you are running on Linux and know how to navigate on the terminal.

In case you get stuck: You getting stuck while following this guide really shouldn't happen, if you do get stuck please open on issue on codeberg.org/unobtanium/unobtanium-documentation.

Resource requirements: This tutorial requires almost 2GB of disk space, make sure you have that much free.

Installing dependencies

You need the following packages:

git - The version control system
rustc - The Rust compiler (sometimes just called rust)
cargo - The Rust build system
SQLite development files - The SQLite database
OpenSSL development files
A text editor for editing files
A web-browser for viewing the final result

These packages don't always have the same names, but they should be available for every Linux distribution. Operating systems other than Linux are currently not supported, consider running this inside a Linux virtual machine.

On alpine Linux:

apk add git rust cargo sqlite-dev openssl-dev

On Debian trixie:

apt install git rustc cargo libsqlite3-dev libssl-dev

On Void-Linux:

xbps-install git rust cargo sqlite-devel openssl-devel

Setting up

To set up create a folder my-first-unobtanium, everything in this tutorial will happen inside it. (The exact name isn't important, but this tutorial is going to reference it a lot)

Now to set up run the following commands:

# Navigate to the folder you just created
cd my-first-unobtanium

# Download the sourcecode using git
git clone https://codeberg.org/unobtanium/unobtanium

# Navigate inside the sourcecode
cd unobtanium

# Use a known working version of unobtanium
# that doesn't require extra steps
git checkout v3.0.0

# Git will complain about something it calls 'detached HEAD' state.
# This is okay since we won't be doing any development.

# Run the rust compiler to build release optimized versions
# This will take a while ...
cargo build --release

# Create a folder outside the source code
# where we can put the resulting binaries
mkdir ../bin

# Copy the binaries to the bin folder we just created
cp target/release/unobtanium-viewer ../bin/
cp target/release/unobtanium-crawler ../bin/

# Free up some space
cargo clean

# Back to the my-first-unobtanium folder
cd ..

# Tell your shell that there are additional commands
# in the bin folder so you can use them by typing their name.
# This is not permanent:
# If you come back later remember to repeat this step.
export PATH="$PATH:$PWD/bin/"

# Make sure the crawler is there
unobtanium-crawler --help

# Make sure the viewer is there
unobtanium-viewer --help

You now have an environment that will work for the rest of the tutorial.

Your first crawl

To search something you need an index, to build an index you need raw data.

In this step we will:

Create a first configuration file for the crawler
Run the crawler to get data from the web
Summarise the data to get a searchable index

Creating a configuration file

Inside the my-first-unobtanium folder create a text file example_config.toml:

# This is just a human readable name
name = "Unobtanium example index"

# Wait one second between each requst to the same URL origin
default_delay_ms = 1000

# The number of requests to attempt when running the crawler command once.
max_commands_per_run = 100

# Only crawl pages that haven't been crawled within a week
recrawl_interval = "1 week"

# The http `User-Agent`, in this case a placeholder for the tutorial
user_agent = "unobtanium-tutorial-crawler"

# The entry points of the sites that unobtanium shoudl crawl.
seeds = [
	"https://doc.unobtanium.rocks/",
	"https://slatecave.net/",
]

Please don't change the file for now, you can mix it up after you've gotten it working. I know you're curious.

Running the crawler

This step will collect the crawl database from the web.

Crawl Database The crawl database contains raw web pages along with information on when and how they were fetched, other search engines call this their "repository".

Back in the Terminal, inside the my-first-unobtanium folder run:

unobtanium-crawler crawl \
	--config example_config.toml \
	--database example_crawl.db

This will run for about a minute and will collect slightly less than 100 documents from doc.unobtanium.rocks and slatecave.net in what is roughly a 50:50 split.

Running the crawler command again will fetch another (almost) 100 documents.

Interrupting the crawler: You can stop the crawler like any other well behaved command line program through Ctrl+C. When you start it again, it will continue where it left off.

The rest of this section is explanation of what is going on.

To break the command down:

unobtanium-crawl is the unobtanium crawler, it is your go to multitool that implements almost everything of unobtanium that isn't part of the search interface.
crawl is a subcommand to tell it to crawl websites
--config example_config.toml tells it where the configuration file for crawling is, here it is the example_config.toml
--database example_crawl.db tells it to create the file example_crawl.db and use it as the database to store the crawling results in.

While the crawler is running you can observe it doing a number of things:

Initialising it's database
Running the crawler tasks ("Crawl loop")
- fetching robots.txt for … the crawler always fetches the robots.txt of a website first so it doesn't go where it isn't welcome.
- [ file_ingested ] The crawler successfully downloaded a document and stored it
- Scheduling … This is the crawler telling you what it plans to do next, everything except the seed URLs needs to be scheduled before being crawled.
- [ file_of_unknown_type ] … The crawler tried to download a file, but then found out that it doesn't know the format. This happens every few requests and usually is nothing to worry about.
- [ redirect ] … This is the same ay your browser being redirected, the crawler notes those redirects and will at some point schedule the URL the redirect pointed it to.
- [ blocked_at_request_of_remote ] … looks scary, but isn't. This is a page requesting not to be indexed and the crawler just told you that it respected that.
- Ran out of crawl commands on … at some point the limit on requests ("crawl commands") you set in the configuration file will be reached.
[Crawl_Statistics_Report] This is some nice numbers so you know what the crawler did while you weren't looking.
- out_of_patience_origins in case any origins (from the seeds in the configuration file) throws so many errors that the crawler gives up they'll be listed there. For you this should just be an empty list.
- total_requests this is the total number of requests that were sent.
- ingested_files how many files you now have ready for building your search index.
Optimizing and vacuuming the database ... this is the step where the database runs some heavy optimisation and cleanup, for larger databases this usually takes a while, for this tutorial it shouldn't take much longer than a second.

Running the summarizer

The crawler gave you raw data from the web, which is as searchable as a pile of random papers someone dumped on your desk without explanation. The summarizer takes this pile of pages and generates the summary database.

Summary Database: The summary database is the search index of unobtanium and contains data and metadata in a way that is easily searchable.

The summarizer is also built into the crawler, you can run it with the following command in the my-first-unobtanium folder:

unobtanium-crawler summarize \
	--crawler-db example_crawl.db \
	--summary-db example_summary.db

This will run for a few seconds and generate the file example_summary.db.

Interrupting the summarizer: Like the crawler the summarizer can be interrupted through Ctrl+C. It will also resume where it was interrupted.

The rest of this section is explanation of what is going on.

To break down the command:

unobtanium-crawler is again the crawler binary that all the tools are built into
summarize is the subcommand that starts the summarizer.
--crawler-db example_crawl.db tells the summarizer where to find the data that the crawler collected.
--summary-db example_summary.db tells the summarizer to put its output in the file example_summary.db.

The summarizer will do a few things:

Initialise its database
Summarizing Page 1 … This is the summarization loop running batches of up to 1000 documents called pages.
- Found 0 self duplicates. Self duplicates are pages where the metadata from the server indicated that they have changed, but unobtanium detected that their main constant didn't change. On the first run there of course aren't any of these
- Found … exact duplicates. Sometimes the same document is available on two different addresses when unobtanium notices this it picks one of the addresses as the canonical one and flags the others as "exact duplicates".
(Re)generating full text index ... in unobtanium 3.0.0 this rebuilds the part of the index that provides the actual search engine where query words go in and results come out.
Optimizing and vacuuming the summary database ... Cleaning up the database again, this will take longer the larger the database grows, but for the tutorial this should not take much longer than a second.

Running the search frontend

Now that you have an index you want to search it, you can start it using the following command in the my-first-unobtanium folder:

unobtanium-viewer \
	--summary-db example_summary.db \
	--template-location unobtanium/viewer/templates/

You should see it starting up some search workers, starting the templating engine and then telling you Web interface on: 127.0.0.1:3000.

You can now open http://127.0.0.1:3000 in your local web-browser and you'll be greeted by a search box.

Access from the local Network: In case you want/need to open the search engine for your local network pass in --listen 0.0.0.0:3000. In a real deployment on the internet you want to use a reverse proxy.

Some queries for you to try:

unobtanium
slatians hideout
sql
your first search engine

You can stop the viewer with Ctrl+C.

File overview

An overview of all files created in this guide:

📁 my-first-unobtanium
- 📁 unobtanium
  - The unobtanium source code from Codeberg
- 📁 bin
  - ⚙️ unobtanium-crawler
  - ⚙️ unobtanium-viewer
- 📄 example_config.toml - Configuration file
- 📚 example_crawl.db - Crawler database
- 📚 example_summary.db - Summary database

What now?

Congratulations, you now have a working search engine!

To build this from a tutorial-example into a real search engine, whether it be on your own network or for the internet the next steps are:

Change the user_agent in the configuration file (User-Agent manual)
Raise the max_commands_per_run limit (multiplying by 10 for each step works well)
Have a look at the crawler configuration manual to know what is possible.
Add the shared policies file to your crawler by downloading it and adding the --policy-file option to your crawler configuration.
Add your own sites to the seeds list.
Rerun crawling and summarizing, if the database grow too large you can always throw them away and start over.
See the selfhosting guide
Put unobtanium on a server behind a reverse proxy.