<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>Unobtanium Documentation</title>
    <link rel="self" type="application/atom+xml" href="https://doc.unobtanium.rocks/atom.xml"/>
    <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks"/>
    <generator uri="https://www.getzola.org/">Zola</generator>
    <updated>2026-02-01T00:00:00+00:00</updated>
    <id>https://doc.unobtanium.rocks/atom.xml</id>
    <entry xml:lang="en">
        <title>Your first search engine</title>
        <published>2026-02-01T00:00:00+00:00</published>
        <updated>2026-02-01T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/your-first-search-engine/"/>
        <id>https://doc.unobtanium.rocks/manual/your-first-search-engine/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/your-first-search-engine/">&lt;p&gt;This is a step by step guide on setting up your first unobtanium search instance from installing the dependencies to getting first search results.&lt;&#x2F;p&gt;
&lt;p&gt;This guide assumes you are running on Linux and know how to navigate on the terminal.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;In case you get stuck:&lt;&#x2F;b&gt; You getting stuck while following this guide really shouldn&#x27;t happen, if you do get stuck &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-documentation&#x2F;issues&quot;&gt;please open on issue on codeberg.org&#x2F;unobtanium&#x2F;unobtanium-documentation&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;Resource requirements:&lt;&#x2F;b&gt; This tutorial requires almost 2GB of disk space, make sure you have that much free.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;installing-dependencies&quot;&gt;Installing dependencies&lt;&#x2F;h2&gt;
&lt;p&gt;You need the following packages:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;git - The version control system&lt;&#x2F;li&gt;
&lt;li&gt;rustc - The Rust compiler (sometimes just called rust)&lt;&#x2F;li&gt;
&lt;li&gt;cargo - The Rust build system&lt;&#x2F;li&gt;
&lt;li&gt;SQLite development files - The SQLite database&lt;&#x2F;li&gt;
&lt;li&gt;OpenSSL development files&lt;&#x2F;li&gt;
&lt;li&gt;A text editor for editing files&lt;&#x2F;li&gt;
&lt;li&gt;A web-browser for viewing the final result&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;These packages don&#x27;t always have the same names, but they should be available for every Linux distribution. Operating systems other than Linux are currently not supported, consider running this inside a Linux virtual machine.&lt;&#x2F;p&gt;
&lt;p&gt;On alpine Linux:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;apk add git rust cargo sqlite-dev openssl-dev
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;On Debian trixie:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;apt install git rustc cargo libsqlite3-dev libssl-dev
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;On Void-Linux:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;xbps-install git rust cargo sqlite-devel openssl-devel
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;setting-up&quot;&gt;Setting up&lt;&#x2F;h2&gt;
&lt;p&gt;To set up create a folder &lt;code&gt;my-first-unobtanium&lt;&#x2F;code&gt;, everything in this tutorial will happen inside it. (The exact name isn&#x27;t important, but this tutorial is going to reference it a lot)&lt;&#x2F;p&gt;
&lt;p&gt;Now to set up run the following commands:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;# Navigate to the folder you just created
cd my-first-unobtanium

# Download the sourcecode using git
git clone https:&amp;#x2F;&amp;#x2F;codeberg.org&amp;#x2F;unobtanium&amp;#x2F;unobtanium

# Navigate inside the sourcecode
cd unobtanium

# Use a known working version of unobtanium
# that doesn&amp;#x27;t require extra steps
git checkout v3.0.0

# Git will complain about something it calls &amp;#x27;detached HEAD&amp;#x27; state.
# This is okay since we won&amp;#x27;t be doing any development.

# Run the rust compiler to build release optimized versions
# This will take a while ...
cargo build --release

# Create a folder outside the source code
# where we can put the resulting binaries
mkdir ..&amp;#x2F;bin

# Copy the binaries to the bin folder we just created
cp target&amp;#x2F;release&amp;#x2F;unobtanium-viewer ..&amp;#x2F;bin&amp;#x2F;
cp target&amp;#x2F;release&amp;#x2F;unobtanium-crawler ..&amp;#x2F;bin&amp;#x2F;

# Free up some space
cargo clean

# Back to the my-first-unobtanium folder
cd ..

# Tell your shell that there are additional commands
# in the bin folder so you can use them by typing their name.
# This is not permanent:
# If you come back later remember to repeat this step.
export PATH=&amp;quot;$PATH:$PWD&amp;#x2F;bin&amp;#x2F;&amp;quot;

# Make sure the crawler is there
unobtanium-crawler --help

# Make sure the viewer is there
unobtanium-viewer --help
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You now have an environment that will work for the rest of the tutorial.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;your-first-crawl&quot;&gt;Your first crawl&lt;&#x2F;h2&gt;
&lt;p&gt;To search something you need an index, to build an index you need raw data.&lt;&#x2F;p&gt;
&lt;p&gt;In this step we will:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Create a first configuration file for the crawler&lt;&#x2F;li&gt;
&lt;li&gt;Run the crawler to get data from the web&lt;&#x2F;li&gt;
&lt;li&gt;Summarise the data to get a searchable index&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;creating-a-configuration-file&quot;&gt;Creating a configuration file&lt;&#x2F;h3&gt;
&lt;p&gt;Inside the &lt;code&gt;my-first-unobtanium&lt;&#x2F;code&gt; folder create a text file &lt;code&gt;example_config.toml&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;toml&quot; class=&quot;language-toml &quot;&gt;&lt;code class=&quot;language-toml&quot; data-lang=&quot;toml&quot;&gt;# This is just a human readable name
name = &amp;quot;Unobtanium example index&amp;quot;

# Wait one second between each requst to the same URL origin
default_delay_ms = 1000

# The number of requests to attempt when running the crawler command once.
max_commands_per_run = 100

# Only crawl pages that haven&amp;#x27;t been crawled within a week
recrawl_interval = &amp;quot;1 week&amp;quot;

# The http `User-Agent`, in this case a placeholder for the tutorial
user_agent = &amp;quot;unobtanium-tutorial-crawler&amp;quot;

# The entry points of the sites that unobtanium shoudl crawl.
seeds = [
	&amp;quot;https:&amp;#x2F;&amp;#x2F;doc.unobtanium.rocks&amp;#x2F;&amp;quot;,
	&amp;quot;https:&amp;#x2F;&amp;#x2F;slatecave.net&amp;#x2F;&amp;quot;,
]
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Please don&#x27;t change the file for now, you can mix it up after you&#x27;ve gotten it working. I know you&#x27;re curious.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;running-the-crawler&quot;&gt;Running the crawler&lt;&#x2F;h3&gt;
&lt;p&gt;This step will collect the crawl database from the web.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;Crawl Database&lt;&#x2F;b&gt; The crawl database contains raw web pages along with information on when and how they were fetched, other search engines call this their &quot;repository&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;Back in the Terminal, inside the &lt;code&gt;my-first-unobtanium&lt;&#x2F;code&gt; folder run:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;unobtanium-crawler crawl \
	--config example_config.toml \
	--database example_crawl.db
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This will run for about a minute and will collect slightly less than 100 documents from &lt;code&gt;doc.unobtanium.rocks&lt;&#x2F;code&gt; and &lt;code&gt;slatecave.net&lt;&#x2F;code&gt; in what is roughly a 50:50 split.&lt;&#x2F;p&gt;
&lt;p&gt;Running the crawler command again will fetch another (almost) 100 documents.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;Interrupting the crawler:&lt;&#x2F;b&gt; You can stop the crawler like any other well behaved command line program through &lt;key&gt;&lt;key&gt;Ctrl&lt;&#x2F;key&gt;+&lt;key&gt;C&lt;&#x2F;key&gt;&lt;&#x2F;key&gt;. When you start it again, it will continue where it left off.&lt;&#x2F;p&gt;
&lt;p&gt;The rest of this section is explanation of what is going on.&lt;&#x2F;p&gt;
&lt;p&gt;To break the command down:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;unobtanium-crawl&lt;&#x2F;code&gt; is the unobtanium crawler, it is your go to multitool that implements almost everything of unobtanium that isn&#x27;t part of the search interface.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;crawl&lt;&#x2F;code&gt; is a subcommand to tell it to crawl websites&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;--config example_config.toml&lt;&#x2F;code&gt; tells it where the configuration file for crawling is, here it is the &lt;code&gt;example_config.toml&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;--database example_crawl.db&lt;&#x2F;code&gt; tells it to create the file &lt;code&gt;example_crawl.db&lt;&#x2F;code&gt; and use it as the database to store the crawling results in.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;While the crawler is running you can observe it doing a number of things:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Initialising it&#x27;s database&lt;&#x2F;li&gt;
&lt;li&gt;Running the crawler tasks (&quot;Crawl loop&quot;)
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;fetching robots.txt for …&lt;&#x2F;code&gt; the crawler always fetches the &lt;code&gt;robots.txt&lt;&#x2F;code&gt; of a website first so it doesn&#x27;t go where it isn&#x27;t welcome.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;[ file_ingested ]&lt;&#x2F;code&gt; The crawler successfully downloaded a document and stored it&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Scheduling …&lt;&#x2F;code&gt; This is the crawler telling you what it plans to do next, everything except the seed URLs needs to be scheduled before being crawled.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;[ file_of_unknown_type ] …&lt;&#x2F;code&gt; The crawler tried to download a file, but then found out that it doesn&#x27;t know the format. This happens every few requests and usually is nothing to worry about.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;[ redirect ] …&lt;&#x2F;code&gt; This is the same ay your browser being redirected, the crawler notes those redirects and will at some point schedule the URL the redirect pointed it to.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;[ blocked_at_request_of_remote ] …&lt;&#x2F;code&gt; looks scary, but isn&#x27;t. This is a page requesting not to be indexed and the crawler just told you that it respected that.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Ran out of crawl commands on …&lt;&#x2F;code&gt; at some point the limit on requests (&quot;crawl commands&quot;) you set in the configuration file will be reached.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;[Crawl_Statistics_Report]&lt;&#x2F;code&gt; This is some nice numbers so you know what the crawler did while you weren&#x27;t looking.
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;out_of_patience_origins&lt;&#x2F;code&gt; in case any origins (from the &lt;code&gt;seeds&lt;&#x2F;code&gt; in the configuration file) throws so many errors that the crawler gives up they&#x27;ll be listed there. For you this should just be an empty list.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;total_requests&lt;&#x2F;code&gt; this is the total number of requests that were sent.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;ingested_files&lt;&#x2F;code&gt; how many files you now have ready for building your search index.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Optimizing and vacuuming the database ...&lt;&#x2F;code&gt; this is the step where the database runs some heavy optimisation and cleanup, for larger databases this usually takes a while, for this tutorial it shouldn&#x27;t take much longer than a second.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;running-the-summarizer&quot;&gt;Running the summarizer&lt;&#x2F;h2&gt;
&lt;p&gt;The crawler gave you raw data from the web, which is as searchable as a pile of random papers someone dumped on your desk without explanation. The summarizer takes this pile of pages and generates the summary database.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;Summary Database:&lt;&#x2F;b&gt; The summary database is the search index of unobtanium and contains data and metadata in a way that is easily searchable.&lt;&#x2F;p&gt;
&lt;p&gt;The summarizer is also built into the crawler, you can run it with the following command in the &lt;code&gt;my-first-unobtanium&lt;&#x2F;code&gt; folder:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;unobtanium-crawler summarize \
	--crawler-db example_crawl.db \
	--summary-db example_summary.db
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This will run for a few seconds and generate the file &lt;code&gt;example_summary.db&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;Interrupting the summarizer:&lt;&#x2F;b&gt; Like the crawler the summarizer can be interrupted through &lt;key&gt;&lt;key&gt;Ctrl&lt;&#x2F;key&gt;+&lt;key&gt;C&lt;&#x2F;key&gt;&lt;&#x2F;key&gt;. It will also resume where it was interrupted.&lt;&#x2F;p&gt;
&lt;p&gt;The rest of this section is explanation of what is going on.&lt;&#x2F;p&gt;
&lt;p&gt;To break down the command:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;unobtanium-crawler&lt;&#x2F;code&gt; is again the crawler binary that all the tools are built into&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;summarize&lt;&#x2F;code&gt; is the subcommand that starts the summarizer.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;--crawler-db example_crawl.db&lt;&#x2F;code&gt; tells the summarizer where to find the data that the crawler collected.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;--summary-db example_summary.db&lt;&#x2F;code&gt; tells the summarizer to put its output in the file &lt;code&gt;example_summary.db&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The summarizer will do a few things:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Initialise its database&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Summarizing Page 1 …&lt;&#x2F;code&gt; This is the summarization loop running batches of up to 1000 documents called pages.
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Found 0 self duplicates.&lt;&#x2F;code&gt; Self duplicates are pages where the metadata from the server indicated that they have changed, but unobtanium detected that their main constant didn&#x27;t change. On the first run there of course aren&#x27;t any of these&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Found … exact duplicates.&lt;&#x2F;code&gt; Sometimes the same document is available on two different addresses when unobtanium notices this it picks one of the addresses as the canonical one and flags the others as &quot;exact duplicates&quot;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;(Re)generating full text index ...&lt;&#x2F;code&gt; in unobtanium 3.0.0 this rebuilds the part of the index that provides the actual search engine where query words go in and results come out.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Optimizing and vacuuming the summary database ...&lt;&#x2F;code&gt; Cleaning up the database again, this will take longer the larger the database grows, but for the tutorial this should not take much longer than a second.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;running-the-search-frontend&quot;&gt;Running the search frontend&lt;&#x2F;h2&gt;
&lt;p&gt;Now that you have an index you want to search it, you can start it using the following command in the &lt;code&gt;my-first-unobtanium&lt;&#x2F;code&gt; folder:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;unobtanium-viewer \
	--summary-db example_summary.db \
	--template-location unobtanium&amp;#x2F;viewer&amp;#x2F;templates&amp;#x2F;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You should see it starting up some search workers, starting the templating engine and then telling you &lt;code&gt;Web interface on: 127.0.0.1:3000&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;You can now open &lt;a href=&quot;http:&#x2F;&#x2F;127.0.0.1:3000&quot;&gt;http:&#x2F;&#x2F;127.0.0.1:3000&lt;&#x2F;a&gt; in your local web-browser and you&#x27;ll be greeted by a search box.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;Access from the local Network:&lt;&#x2F;b&gt; In case you want&#x2F;need to open the search engine for your local network pass in &lt;code&gt;--listen 0.0.0.0:3000&lt;&#x2F;code&gt;. In a real deployment on the internet you want to use a reverse proxy.&lt;&#x2F;p&gt;
&lt;p&gt;Some queries for you to try:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;unobtanium&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;slatians hideout&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;sql&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;your first search engine&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;You can stop the viewer with &lt;key&gt;&lt;key&gt;Ctrl&lt;&#x2F;key&gt;+&lt;key&gt;C&lt;&#x2F;key&gt;&lt;&#x2F;key&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;file-overview&quot;&gt;File overview&lt;&#x2F;h2&gt;
&lt;p&gt;An overview of all files created in this guide:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;📁 &lt;code&gt;my-first-unobtanium&lt;&#x2F;code&gt;
&lt;ul&gt;
&lt;li&gt;📁 &lt;code&gt;unobtanium&lt;&#x2F;code&gt;
&lt;ul&gt;
&lt;li&gt;The unobtanium source code from Codeberg&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;📁 &lt;code&gt;bin&lt;&#x2F;code&gt;
&lt;ul&gt;
&lt;li&gt;⚙️ &lt;code&gt;unobtanium-crawler&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;⚙️ &lt;code&gt;unobtanium-viewer&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;📄 &lt;code&gt;example_config.toml&lt;&#x2F;code&gt; - Configuration file&lt;&#x2F;li&gt;
&lt;li&gt;📚 &lt;code&gt;example_crawl.db&lt;&#x2F;code&gt; - Crawler database&lt;&#x2F;li&gt;
&lt;li&gt;📚 &lt;code&gt;example_summary.db&lt;&#x2F;code&gt; - Summary database&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;what-now&quot;&gt;What now?&lt;&#x2F;h2&gt;
&lt;p&gt;Congratulations, you now have a working search engine!&lt;&#x2F;p&gt;
&lt;p&gt;To build this from a tutorial-example into a real search engine, whether it be on your own network or for the internet the next steps are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Change the &lt;code&gt;user_agent&lt;&#x2F;code&gt; in the configuration file (&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-user-agent-and-robots-txt&#x2F;&quot;&gt;User-Agent manual&lt;&#x2F;a&gt;)&lt;&#x2F;li&gt;
&lt;li&gt;Raise the &lt;code&gt;max_commands_per_run&lt;&#x2F;code&gt; limit (multiplying by 10 for each step works well)&lt;&#x2F;li&gt;
&lt;li&gt;Have a look at the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;&quot;&gt;crawler configuration manual&lt;&#x2F;a&gt; to know what is possible.&lt;&#x2F;li&gt;
&lt;li&gt;Add the &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-rocks-index&#x2F;src&#x2F;branch&#x2F;main&#x2F;shared-policies.toml&quot;&gt;shared policies file&lt;&#x2F;a&gt; to your crawler by downloading it and adding the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;#crawl-c-database&quot;&gt;&lt;code&gt;--policy-file&lt;&#x2F;code&gt; option&lt;&#x2F;a&gt; to your crawler configuration.&lt;&#x2F;li&gt;
&lt;li&gt;Add your own sites to the &lt;code&gt;seeds&lt;&#x2F;code&gt; list.&lt;&#x2F;li&gt;
&lt;li&gt;Rerun crawling and summarizing, if the database grow too large you can always throw them away and start over.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;selfhosting&#x2F;&quot;&gt;See the selfhosting guide&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Put unobtanium on a server behind a reverse proxy.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Text Pile</title>
        <published>2025-11-28T00:00:00+00:00</published>
        <updated>2025-11-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/concept/text-pile/"/>
        <id>https://doc.unobtanium.rocks/concept/text-pile/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/concept/text-pile/">&lt;p&gt;The text pile as a concept is that the search engine stores text independent of where it found that text so that it can reuse the storage instace in case of a page being reachable via more than one address (&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;&quot;&gt;exact duplicates&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;The text piles store the plain text from the pages it was scraped from in a cleaned up form along with some metadata about the semantics of the text and information directly derived from the text.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementations&quot;&gt;Implementations&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;Text Pile Generation 1 (&quot;Legacy&quot;)&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;Text Pile Generation 2 (&quot;NG&quot;)&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;!-- Do not forget to update data&#x2F;text_pile.md --&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Text Pile Gen2</title>
        <published>2025-11-28T00:00:00+00:00</published>
        <updated>2025-11-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/text-pile-gen2/"/>
        <id>https://doc.unobtanium.rocks/data/text-pile-gen2/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/text-pile-gen2/">&lt;p&gt;The Text Pile &lt;abbr title=&quot;Generation 2&quot;&gt;Gen2&lt;&#x2F;abbr&gt; (also known as Text Pile v2 or &quot;NG&quot;) is a datastructure for storing text content along with limited formatting and optional segmenting information. It is extracted from pages during the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;&quot;&gt;summarizing step&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;It is implemented in the &lt;a href=&quot;https:&#x2F;&#x2F;crates.io&#x2F;unobtanium-text-pile&quot;&gt;&lt;code&gt;unobtanium-text-pile&lt;&#x2F;code&gt; crate&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;This replaced the Text Pile Gen1:&lt;&#x2F;b&gt; The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;Text Pile&lt;&#x2F;a&gt; was used in unobtanium release 3.0.0 and older in place of the datastructure described on this page.&lt;&#x2F;p&gt;
&lt;p&gt;The Text Pile NG stores the following data:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The raw page text in a cleaned up form&lt;&#x2F;li&gt;
&lt;li&gt;Metadata:
&lt;ul&gt;
&lt;li&gt;Semantic markers (roughly: HTML tag semantics)&lt;&#x2F;li&gt;
&lt;li&gt;The language indicated in the document&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Optional segmentation information (&quot;segmentation cache&quot;):
&lt;ul&gt;
&lt;li&gt;Segment lengths&lt;&#x2F;li&gt;
&lt;li&gt;A relevance marker for each segment&lt;&#x2F;li&gt;
&lt;li&gt;Sentence lengths&lt;&#x2F;li&gt;
&lt;li&gt;The language guessed by the segmenter&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The segmentation information is intended to be updated along with the segmentation algorithm.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;digest&quot;&gt;Digest&lt;&#x2F;h2&gt;
&lt;p&gt;The digest of the Text Pile Ng is calculated using the &lt;a href=&quot;https:&#x2F;&#x2F;www.blake2.net&#x2F;&quot;&gt;Blake2b&lt;&#x2F;a&gt; 512 bit algorithm. It is used for detection of &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;&quot;&gt;exact duplicates&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The Digest is calculated across the shortest possible &lt;a href=&quot;https:&#x2F;&#x2F;postcard.jamesmunns.com&#x2F;&quot;&gt;postcard serialization&lt;&#x2F;a&gt; of the text and metadata sections of the text pile ng.&lt;&#x2F;p&gt;
&lt;p&gt;The segmentation cache is &lt;strong&gt;not&lt;&#x2F;strong&gt; included as it is derived data and may change over time to allow the segmentation algorithm to be updated.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;comparison-with-the-previous-text-pile&quot;&gt;Comparison with the previous Text Pile&lt;&#x2F;h2&gt;
&lt;p&gt;The old &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;Text Pile&lt;&#x2F;a&gt; was optimized for use with the sqlite fts5 extension and had multiple strings, depending on the semantics the text was appended to one of them.&lt;&#x2F;p&gt;
&lt;p&gt;This had the following drawbacks:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;It couldn&#x27;t recall the original text in original order for previews&lt;&#x2F;li&gt;
&lt;li&gt;No segmentation cache means running expensive segmentation algorithm over and over&lt;&#x2F;li&gt;
&lt;li&gt;No preservation of language metadata, which is important for segmentation&lt;&#x2F;li&gt;
&lt;li&gt;Only limited preservation of document semantics&lt;&#x2F;li&gt;
&lt;li&gt;A lot of duplicate storage as text could be appended to multiple fields&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;see-also&quot;&gt;See Also&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;issues&#x2F;37&quot;&gt;Initial tracking issue &quot;TextPile v2 format (TextPileNg) #37&quot;&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;unobtanium-text-pile&#x2F;latest&#x2F;unobtanium_text_pile&#x2F;&quot;&gt;The &lt;code&gt;unobtanium-text-pile&lt;&#x2F;code&gt; on docs.rs&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-text-pile&quot;&gt;The &lt;code&gt;unobtanium-text-pile&lt;&#x2F;code&gt; on Codeberg&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Text Pile (Disambiguation)</title>
        <published>2025-11-28T00:00:00+00:00</published>
        <updated>2025-11-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/text-pile/"/>
        <id>https://doc.unobtanium.rocks/data/text-pile/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/text-pile/">&lt;p&gt;You are probably here because you followed an old link or Bookmark.&lt;&#x2F;p&gt;
&lt;p&gt;Unobtanium now supports multiple text piles:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;Text Pile Generation 1 (&quot;Legacy&quot;)&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;Text Pile Generation 2 (&quot;NG&quot;)&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;There is also a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;text-pile&#x2F;&quot;&gt;concept page for text piles in general&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note for documentation editors:&lt;&#x2F;b&gt; Do not link to this page, prefer the concept page instead.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Selfhosting Unobtanium</title>
        <published>2025-08-15T00:00:00+00:00</published>
        <updated>2025-08-15T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/selfhosting/"/>
        <id>https://doc.unobtanium.rocks/manual/selfhosting/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/selfhosting/">&lt;p&gt;&lt;b&gt;Warning:&lt;&#x2F;b&gt; This guide is incomplete, see the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;your-first-search-engine&#x2F;&quot;&gt;Your first search engine guide&lt;&#x2F;a&gt; instead.&lt;&#x2F;p&gt;
&lt;p&gt;I&#x27;ll assume you already know a few things:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Linux administration in general&lt;&#x2F;li&gt;
&lt;li&gt;Configuring a web reverse proxy&lt;&#x2F;li&gt;
&lt;li&gt;HTML&lt;&#x2F;li&gt;
&lt;li&gt;SQL&lt;&#x2F;li&gt;
&lt;li&gt;TOML&lt;&#x2F;li&gt;
&lt;li&gt;How to use &lt;code&gt;cargo build&lt;&#x2F;code&gt; to build your own binaries&lt;&#x2F;li&gt;
&lt;li&gt;Shell scripting (for your own convenience)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;If you haven&#x27;t yet please have a look at the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;overview&#x2F;&quot;&gt;overview page&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;resource-requirements&quot;&gt;Resource requirements&lt;&#x2F;h2&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
		
	
		
			
			
		
	
		
			
		
	
		
			
			
		
	
		
			
			
				
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		CPU
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		No special requirements, it should have at least 2 cores.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dt&gt;
		RAM
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Plan at least 2GB for unobtanium alone, this includes some buffer for caching and making sure nothing runs out of memory. In practice its very likely you&#x27;ll use less, but not having the RAM free for the Kernel to use as cache will noticeably slow down searching.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dt&gt;
		Disk
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		For a small index of one medium size blog 100MB shoukd be enough.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		For large deployments plan ~10GB per 100K searchable pages for the crawler database and another 6GB per 100K pages for one summary database (in practice you&#x27;ll probably want two of them to get new data in without downtime).
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;what-you-need-to-do-overview&quot;&gt;What you need to do, overview&lt;&#x2F;h2&gt;
&lt;p&gt;Bootstrapping a search engine:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Writing and testing an initial crawler configuration&lt;&#x2F;li&gt;
&lt;li&gt;Expanding the configuration&lt;&#x2F;li&gt;
&lt;li&gt;Doing a full initial crawl and summary&lt;&#x2F;li&gt;
&lt;li&gt;Setting up the frontend&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Maintainence:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Semi regular recrawls&lt;&#x2F;li&gt;
&lt;li&gt;Updating the crawler configuration&lt;&#x2F;li&gt;
&lt;li&gt;Keeping unobtanium updated&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;your-first-crawler-configuration&quot;&gt;Your first crawler configuration&lt;&#x2F;h2&gt;
&lt;p&gt;Have the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;&quot;&gt;crawler crawl configuration manual&lt;&#x2F;a&gt; ready.&lt;&#x2F;p&gt;
&lt;p&gt;This file will define where the crawler is allowed to collect pages from.&lt;&#x2F;p&gt;
&lt;p&gt;You may find the &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-rocks-index&quot;&gt;configuration of unobtanium.rocks&lt;&#x2F;a&gt; useful. Use the &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-rocks-index&#x2F;src&#x2F;branch&#x2F;main&#x2F;shared-policies.toml&quot;&gt;shared policies file&lt;&#x2F;a&gt; with the &lt;code&gt;--policy-file&lt;&#x2F;code&gt; option, it avoids some common parts of websites that the crawler should avoid.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;Hint:&lt;&#x2F;b&gt; Start with a small configuration of one to three seeds and a crawler command limit of 1000. This will quickly give you a feel for &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;how the crawler behaves&lt;&#x2F;a&gt; and what it tries to collect.&lt;&#x2F;p&gt;
&lt;p&gt;After the first successful crawl try running the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;#summarize&quot;&gt;summarizer&lt;&#x2F;a&gt; and the pointing a locally running &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-viewer&#x2F;&quot;&gt;viewer&lt;&#x2F;a&gt; at the resulting summary database. Try typing in some keywords you&#x27;d expect to find results for.&lt;&#x2F;p&gt;
&lt;p&gt;Once that works you can expand the crawling configuration and raise the crawler command limit.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawl Data Tree</title>
        <published>2025-08-14T00:00:00+00:00</published>
        <updated>2025-08-14T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/concept/crawl-data-tree/"/>
        <id>https://doc.unobtanium.rocks/concept/crawl-data-tree/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/concept/crawl-data-tree/">&lt;p&gt;The crawl data tree is a construct in the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawler-database&#x2F;&quot;&gt;crawler database&lt;&#x2F;a&gt; sourrounding the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Each crawl log entry &lt;strong&gt;may&lt;&#x2F;strong&gt; have:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;one mandatory &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log entry&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;multiple requests&lt;&#x2F;li&gt;
&lt;li&gt;multiple files
&lt;ul&gt;
&lt;li&gt;one file text&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;multiple redirects&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Files and redirects reference the requst they came from if they were derived from a network request.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Token Index</title>
        <published>2025-08-14T00:00:00+00:00</published>
        <updated>2025-08-14T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/concept/token-index/"/>
        <id>https://doc.unobtanium.rocks/concept/token-index/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/concept/token-index/">&lt;p&gt;The token index is an experimental way to store a full text index in SQL tables in the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;summary-database&#x2F;&quot;&gt;summary database&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A token in the sense of the token index is rouhly equivalent to a word after splitting up &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Compound_(linguistics)&quot;&gt;compound words&lt;&#x2F;a&gt; and applying normalization.&lt;&#x2F;p&gt;
&lt;p&gt;In its current state there is a statistics table that keeps track of how often a token appears in a given &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;text-pile&#x2F;&quot;&gt;text pile&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;open-problems&quot;&gt;Open Problems&lt;&#x2F;h2&gt;
&lt;p&gt;At least the following problems need to be solved for the token index to exit its experimental state:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Page previews must be generated ouside of fts5. &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;pulls&#x2F;36&quot;&gt;See pull request #36&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;It must support at least BM25 or another result weighting mechanism.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;how-to-access-the-token-index&quot;&gt;How to access the token index&lt;&#x2F;h2&gt;
&lt;p&gt;The index being experimental isn&#x27;t built by default and can be (re)build using &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;#regenerate-token-index&quot;&gt;&lt;code&gt;unobtanium-crawler regenerate-token&lt;&#x2F;code&gt; command&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;It can be queried using the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;query-syntax&#x2F;&quot;&gt;&lt;code&gt;token:&lt;&#x2F;code&gt; filter&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawler Database</title>
        <published>2025-08-13T00:00:00+00:00</published>
        <updated>2025-08-13T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/crawler-database/"/>
        <id>https://doc.unobtanium.rocks/data/crawler-database/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/crawler-database/">&lt;p&gt;The crawler database schema is implemented on top of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;&quot;&gt;base database schema&lt;&#x2F;a&gt; and mainly holds data surrounding the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log&lt;&#x2F;a&gt; (see also: &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;crawl-data-tree&#x2F;&quot;&gt;crawl data tree&lt;&#x2F;a&gt;) and the crawl candidates.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;Tables in the crawler database are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;agent&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Information about crawling agents
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_log&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;Crawl log entries&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;request&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Requests that belong to a crawl log entry
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;file&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		File metadata obtained from a request
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;file_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Text content belonging to an entry in the &lt;code&gt;file&lt;&#x2F;code&gt; table
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;redirect&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Redirect resulting from a request
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_candidate&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		A URL that was discovered in a context that makes it a potential crawling target
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;tables&quot;&gt;Tables&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;agent&quot;&gt;&lt;code&gt;agent&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;agent&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;agent_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_started_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt;  When the agent started its work.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_finished_unitx_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; When the agent finished its work.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the agent is currently running or was forcefully terminated.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;agent_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#uuid&#x27; &gt;UUID&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;External Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;name&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The name of the crawler as specified using the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;#crawl&quot;&gt;&lt;code&gt;--worker-name&lt;&#x2F;code&gt; option&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;http_user_agent&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The HTTP user agent that was used by the crawler.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the the concept of a user agent isn&#x27;t applicable to the crawler.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;crawl-log&quot;&gt;&lt;code&gt;crawl_log&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The crawl log table gets written after a crawl command has been finished to log the outcome.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;crawl_log&lt;&#x2F;code&gt; table stores &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log entries&lt;&#x2F;a&gt;, it has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;crawl_log_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;agent_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;#agent&quot;&gt;&lt;code&gt;agent&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of which agent is responsible for the entry.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL was crawled.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_type&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  Which &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;#crawl-type&quot;&gt;crawl type&lt;&#x2F;a&gt; resulted in this entry.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#uuid&#x27; &gt;UUID&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;External Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_started_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt;  When the action that resulted in this entry started.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_taken_ms&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt;  How long the action took.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;exit_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  Which outcome the action had, see &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-exit-code&#x2F;&quot;&gt;crawl exit code&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;message&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; A place to store an error message, this is intended to help humans with debugging.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;Indices:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;crawl_log_quickinfo&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		on fields &lt;code&gt;url_id&lt;&#x2F;code&gt;, &lt;code&gt;time_started_unix_utc&lt;&#x2F;code&gt;, &lt;code&gt;exit_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		For speeding up querying the last exit code of a URL
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;request&quot;&gt;&lt;code&gt;request&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;request&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;request_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_log_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Which entry in the &lt;a href=&quot;#crawl-log&quot;&gt;&lt;code&gt;crawl-log&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; this request belongs to.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the [&lt;code&gt;url&lt;&#x2F;code&gt; table] which URL was requested (this may be different from the crawled URL, i.e. when scraping a CMS API).
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_sent_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt;  When the request was sent.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;request_duration_ms&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-duration&#x27; &gt;Integer &#x2F; Duration&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; How long the request took.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that measuring the time failed.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;robotstxt_approved&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether the request was approved by a &lt;code&gt;robots.txt&lt;&#x2F;code&gt; file.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;exit_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-exit-code&#x2F;&quot;&gt;crawl exit code&lt;&#x2F;a&gt; of this single request.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;server_last_modified_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; When the server claimed that the file was last modified.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the header information about the last modification time was missing.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;http_status_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The &lt;a href=&quot;https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;HTTP&#x2F;Reference&#x2F;Status&quot;&gt;HTTP Status Code&lt;&#x2F;a&gt; that this request resulted in.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that no HTTP response was received.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;http_etag&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Varchar&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The the content of the &lt;a href=&quot;https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;HTTP&#x2F;Reference&#x2F;Headers&#x2F;ETag&quot;&gt;&lt;code&gt;ETag&lt;&#x2F;code&gt; header&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that no ETag header was received.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;file&quot;&gt;&lt;code&gt;file&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; The &lt;code&gt;file&lt;&#x2F;code&gt; and &lt;code&gt;redirect&lt;&#x2F;code&gt; tables have the following fields in common: &lt;code&gt;crawl_log_id&lt;&#x2F;code&gt;, &lt;code&gt;request_id&lt;&#x2F;code&gt;, &lt;code&gt;url_id&lt;&#x2F;code&gt;, &lt;code&gt;last_modified_unix_utc&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;file&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;file_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_log_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;#crawl_log&quot;&gt;&lt;code&gt;crawl_log&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of which crawl log entry resulted in this file. If a request is linked the request must be from the same crawl log entry.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;request_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#request&quot;&gt;&lt;code&gt;request&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; from which request this file metadata came from.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the file didn&#x27;t result from a network request. (i.e. by reading from a dump)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL this file is associated with.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_modified_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the file didn&#x27;t contain any readable metadata on its last modification date.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;file_size&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The file size in bytes
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the file size is unknown (i.e. the file wasn&#x27;t fetched and no metadata was present)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;mimetype_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;#mimetype&quot;&gt;&lt;code&gt;mimetype&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which MIME-Type (Media Type) the file has.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;canonical_url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL the file claims its canonical version to be at. (See also &lt;a href=&quot;https:&#x2F;&#x2F;www.rfc-editor.org&#x2F;rfc&#x2F;rfc6596&quot;&gt;RFC 6596&lt;&#x2F;a&gt;)
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the file didn&#x27;t claim the be the non-canonical version of another resource.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;file-text&quot;&gt;&lt;code&gt;file_text&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;file_text&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;file_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#file&quot;&gt;&lt;code&gt;file&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; entry that holds the metadata
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Unparsed text content of the file translated to UTF-8.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;redirect&quot;&gt;&lt;code&gt;redirect&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; The &lt;code&gt;file&lt;&#x2F;code&gt; and &lt;code&gt;redirect&lt;&#x2F;code&gt; tables have the following fields in common: &lt;code&gt;crawl_log_id&lt;&#x2F;code&gt;, &lt;code&gt;request_id&lt;&#x2F;code&gt;, &lt;code&gt;url_id&lt;&#x2F;code&gt;, &lt;code&gt;last_modified_unix_utc&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;redirect&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;redirect_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_log_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;#crawl_log&quot;&gt;&lt;code&gt;crawl_log&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of which crawl log entry resulted in this redirect. If a request is linked the request must be from the same crawl log entry.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;request_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#request&quot;&gt;&lt;code&gt;request&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; from which request this redirect came from.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the redirect didn&#x27;t result from a network request. (i.e. by reading from a dump)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL this redirect is associated with.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_modified_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the file didn&#x27;t contain any readable metadata on its last modification date.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;to_url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL the redirect points at.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;information_source&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  Where the information for this redirect came from, see &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;information-source&#x2F;&quot;&gt;Information Source&lt;&#x2F;a&gt; for possible values.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;is_permanent&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether one can expect future requests to result in the same redirect.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;by_security_policy&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether the redirect was because of a security policy (i.e. an automatic &lt;code&gt;http&lt;&#x2F;code&gt; to &lt;code&gt;https&lt;&#x2F;code&gt; upgrade).
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		🔧 This field is currently &lt;strong&gt;unused&lt;&#x2F;strong&gt; and may be removed in the future. (See &lt;a href=&quot;#possible-future-changes&quot;&gt;Possible future changes&lt;&#x2F;a&gt;).
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;crawl-candidate&quot;&gt;&lt;code&gt;crawl_candidate&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; This table contains information that is also present in the crawl log. This is to make sure that the information doesn&#x27;t get lost when the crawl log gets cleaned up.&lt;&#x2F;p&gt;
&lt;p&gt;This table attached crawling specific metadata to URLs.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;crawl_candidate&lt;&#x2F;code&gt; table contains the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the URL that this metadata is for.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_crawl_time_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; When the URL was last crawled
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the URL has only been discovered as crawlable, but not been crawled yet.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_crawl_exit_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-exit-code&#x2F;&quot;&gt;crawl exit code&lt;&#x2F;a&gt; of the last crawl.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the exit code is unavailable.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_contentful_crawl_time_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The last time the crawler exited with a code from the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-exit-code&#x2F;#contentful&quot;&gt;contentful category of exit codes&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that there never was a crawl with a contentful exit code.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_contentful_http_etag&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Set together with &lt;code&gt;last_contentful_crawl_time_unix_utc&lt;&#x2F;code&gt;, this will contain the ETag header that was returned on the last contentful crawl.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that there was no last contentful crawl or the last contentful crawl didn&#x27;t have a ETag header set.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;possible-future-changes&quot;&gt;Possible future changes&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;Addition of UUID fields to the tables &lt;code&gt;request&lt;&#x2F;code&gt;, &lt;code&gt;file&lt;&#x2F;code&gt; and &lt;code&gt;redirect&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawler-database&#x2F;#redirect&quot;&gt;&lt;code&gt;redirect&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;.&lt;code&gt;by_security_policy&lt;&#x2F;code&gt; field could be integrated into the &lt;code&gt;information_source&lt;&#x2F;code&gt;. It is currently never set to true by the crawler.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;version-history&quot;&gt;Version history&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;1-0-0-de-facto-stable&quot;&gt;1.0.0 - De-facto stable&lt;&#x2F;h3&gt;
&lt;p&gt;This schema has been de-facto stable for a while and been assigned the 1.0.0 version with the introduction of database versioning.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Database Data Types</title>
        <published>2025-08-13T00:00:00+00:00</published>
        <updated>2025-08-13T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/database-data-types/"/>
        <id>https://doc.unobtanium.rocks/data/database-data-types/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/database-data-types/">&lt;p&gt;This document describes the column datatypes used in the unobtanium databases.&lt;&#x2F;p&gt;
&lt;p&gt;By default unless indicated by a &lt;span class=&quot;null-badge&quot;&gt;Null&lt;&#x2F;span&gt; keyword in the documentation the datatypes are not nullable and in SQL use the &lt;code&gt;NOT NULL&lt;&#x2F;code&gt; keyword. In general &lt;code&gt;NULL&lt;&#x2F;code&gt; should mean unknown or unavailable data unless documented otherwise. The meaning of a &lt;code&gt;NULL&lt;&#x2F;code&gt; value &lt;strong&gt;must&lt;&#x2F;strong&gt; always be documented.&lt;&#x2F;p&gt;
&lt;p&gt;Columns that are marked an &lt;span class=&quot;unique-badge&quot;&gt;Unique&lt;&#x2F;span&gt; only allow unique vlues using the &lt;code&gt;UNIQUE&lt;&#x2F;code&gt; SQL keyword, however they do allow &lt;strong&gt;multiple null values&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Unobtanium uses SQLite as its database, see the &lt;a href=&quot;https:&#x2F;&#x2F;sqlite.org&#x2F;datatype3.html&quot;&gt;SQLite documentation on datatypes&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Note that some advice on this page regarding datatype usage isn&#x27;t followed. These are probbly the tables where a mistake was made in the past resulting in the advice here.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;bool&quot;&gt;Bool&lt;&#x2F;h2&gt;
&lt;p&gt;Booleans are used to store true&#x2F;false values represented using the SQLite &lt;code&gt;BOOL&lt;&#x2F;code&gt; datatype and &lt;code&gt;1&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;0&lt;&#x2F;code&gt; for the values.&lt;&#x2F;p&gt;
&lt;p&gt;Column names should be prefixed with &lt;code&gt;is_&lt;&#x2F;code&gt; or &lt;code&gt;has_&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;text&quot;&gt;Text&lt;&#x2F;h2&gt;
&lt;p&gt;Text is represented using the &lt;code&gt;TEXT&lt;&#x2F;code&gt; SQLite datatype.&lt;&#x2F;p&gt;
&lt;p&gt;Text &lt;strong&gt;must&lt;&#x2F;strong&gt; be encoded as UTF-8.&lt;&#x2F;p&gt;
&lt;p&gt;Advice:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;If possible normalize empty strings to a &lt;code&gt;NULL&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Do not use &lt;code&gt;VARCHAR&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Check to see if there is already a table in the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;&quot;&gt;base schema&lt;&#x2F;a&gt; that you can reference.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;integer&quot;&gt;Integer&lt;&#x2F;h2&gt;
&lt;p&gt;Integers are always stored using the SQLite &lt;code&gt;INTEGER&lt;&#x2F;code&gt; datatype, the rust equivlent is &lt;code&gt;i64&lt;&#x2F;code&gt;, while always encoded as a signed integer negative values don&#x27;t always make sense here.&lt;&#x2F;p&gt;
&lt;p&gt;Integers are usually used to encode some other data.&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt id=&quot;integer-primary-key&quot;&gt;
		Primary Key
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		As a database internal identifier, these may not be handed out via external interfaces.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		The SQL for SQLite must be &lt;code&gt;INTEGER NOT NULL PRIMARY KEY&lt;&#x2F;code&gt;, the columns are called &lt;code&gt;{table_name}_id&lt;&#x2F;code&gt;.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Marked as &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		To enforce 1:1 relationships this may double as a reference into another table, in this case the colum name follows the reference naming scheme.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;integer-reference-foregin-key&quot;&gt;
		Reference &#x2F; Foregin Key
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		To reference another tables primary key.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		The column must be named &lt;code&gt;[(to|from)_]{table_name}[_{purpose}]_id&lt;&#x2F;code&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;integer-enumeration&quot;&gt;
		Enumeration
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		In the case they are uised for enumerations there &lt;strong&gt;must&lt;&#x2F;strong&gt; be linked documentation on a list of valid values and what they mean.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;integer-timestamp&quot;&gt;
		Timestamp
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		To store a Timestamps in unix time since &lt;code&gt;1970-01-01 00:00:00 UTC&lt;&#x2F;code&gt; in seconds, this is consistent with how SQLite expects timestamps to be stored.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Column names must end with &lt;code&gt;_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;integer-duration&quot;&gt;
		Duration
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		To store a length of time in some unit.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		The column name must end with a unit mentioned here: &lt;code&gt;_ms&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;uuid&quot;&gt;UUID&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Universally_unique_identifier&quot;&gt;UUIDs&lt;&#x2F;a&gt; are represented as &lt;code&gt;BLOB(16)&lt;&#x2F;code&gt; types in SQLite. Use the &lt;code&gt;hex()&lt;&#x2F;code&gt; function to get the human reable representation.&lt;&#x2F;p&gt;
&lt;p&gt;UUIDs are used as indentifiers that can be generated without the database, are unique across multiple databases and can be handed out via APIs as external identifiers.&lt;&#x2F;p&gt;
&lt;p&gt;If a UUID colum fulfills a similar role as priamry key, the column name should be &lt;code&gt;{table_name}_uuid&lt;&#x2F;code&gt;, in this case it is marked using &lt;span class=&quot;primary-key-badge&quot;&gt;External Key&lt;&#x2F;span&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Advice:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;If unsure generate &lt;code&gt;v7&lt;&#x2F;code&gt; (timestmp+random, preferred) or &lt;code&gt;v4&lt;&#x2F;code&gt; (random) UUIDs.&lt;&#x2F;li&gt;
&lt;li&gt;Treat UUIDs as opaque identifiers.&lt;&#x2F;li&gt;
&lt;li&gt;Use UUIDs to hnd out identifiers via interface outside the application.&lt;&#x2F;li&gt;
&lt;li&gt;Do not use UUIDs for database internal references.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;blob&quot;&gt;Blob&lt;&#x2F;h2&gt;
&lt;p&gt;Blobs are represented as &lt;code&gt;BLOB&lt;&#x2F;code&gt; types in SQLite. They can be used to store arbitrary binary information.&lt;&#x2F;p&gt;
&lt;p&gt;The format &lt;strong&gt;must&lt;&#x2F;strong&gt; be documented.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Information Source</title>
        <published>2025-08-13T00:00:00+00:00</published>
        <updated>2025-08-13T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/information-source/"/>
        <id>https://doc.unobtanium.rocks/data/information-source/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/information-source/">&lt;p&gt;The information source enumeration describes which part of a document a piece of information came from.&lt;&#x2F;p&gt;
&lt;p&gt;Possible values are:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Id&lt;&#x2F;th&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;unknown&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The source is officially unknown&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;response_header&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;From a (network) response header&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file_metadata&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;From the metadata section of a file&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file_body&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;From the content section of a file&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;summary&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The information is a summary derived from other data&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Link destination type</title>
        <published>2025-08-13T00:00:00+00:00</published>
        <updated>2025-08-13T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/link-destination-type/"/>
        <id>https://doc.unobtanium.rocks/data/link-destination-type/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/link-destination-type/">&lt;p&gt;The link destination type enumerates whtto expect when follwing a link across all of its redirects.&lt;&#x2F;p&gt;
&lt;p&gt;It is mostly derived from the html element that represents the link.&lt;&#x2F;p&gt;
&lt;p&gt;Available values are:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Id&lt;&#x2F;th&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Descrioption&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Unknown content&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;document&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Browsable document&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;feed&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;A feed as in RSS, Atom or similar&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;stylesheet&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;A stylesheet&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;script&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Executable script inteded for a browser&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;100&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;media&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Media or Media related file (image, audio, video, playlist, subitles, etc.)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;101&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;media_image&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Image File&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;102&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;media_audio&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Audio File&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;103&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;media_video&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Video File&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;1000&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;communication&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Communication link like &lt;code&gt;mailto&lt;&#x2F;code&gt; or &lt;code&gt;tel&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Summary Database</title>
        <published>2025-08-13T00:00:00+00:00</published>
        <updated>2025-08-13T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/summary-database/"/>
        <id>https://doc.unobtanium.rocks/data/summary-database/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/summary-database/">&lt;p&gt;The summary database scheme is implemented on top of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;&quot;&gt;base database schema&lt;&#x2F;a&gt; and mainly houses the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;entity-data-tree&#x2F;&quot;&gt;entity data tree&lt;&#x2F;a&gt;. It is mainly built by the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;&quot;&gt;summarizes algorithm&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;Tables in the base database are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The root of all summary data, spacetime coordinates for websites. See &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;entity generation&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;duplicate_summary&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Information about &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#exact-duplicate&quot;&gt;exact duplicates&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_summary&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Stores the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-summary&#x2F;&quot;&gt;crawl summaries&lt;&#x2F;a&gt; that is derived from the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;http_summary&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		HTTP extension for the &lt;code&gt;crawl_summary&lt;&#x2F;code&gt; table.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;file_summary&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		File metadata derived from the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawler-database&#x2F;#file&quot;&gt;&lt;code&gt;file&lt;&#x2F;code&gt; table in the crawler database&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;redirect_summary&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Redirect metadata derived from the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawler-database&#x2F;#redirect&quot;&gt;&lt;code&gt;redirect&lt;&#x2F;code&gt; table in the crawler database&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;link_summary&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Lists parsed links in files.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;document_description&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Document metadata.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text_pile&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Collection of document text indexed by hash (deprecated), see &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;text pile gen1&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text_pile_v0_2&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Contains the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;Text Pile Gen2&lt;&#x2F;a&gt; as generated by version 0.2 of the &lt;code&gt;unobtanium-text-pile&lt;&#x2F;code&gt; crate.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Since &lt;code&gt;0.0.0-textrank-preview.0&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;token&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Enumerates all possible tokens that can make up a document in this database.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;token_statistics&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Maps tokens to text piles, also referred to as the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;token-index&#x2F;&quot;&gt;token index&lt;&#x2F;a&gt;, this is experimental.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;tables&quot;&gt;Tables&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;entity-generation&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;entity_generation&lt;&#x2F;code&gt; table stores &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;entity generations&lt;&#x2F;a&gt;, it has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#uuid&#x27; &gt;UUID&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;External Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL this entity generation is about.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;url_fragment&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The id of the document element this entity generation is about.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that this entity generation is about the whole document
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;first_seen_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt;  First known existence of this generation (may change if better data is integrated)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_seen_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt;  Last known existence, may equal &lt;code&gt;first_seen_unix_utc&lt;&#x2F;code&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;confirmed_end_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The time the entity generation is known to be closed. (i.e. the &lt;code&gt;first_seen&lt;&#x2F;code&gt; of the next entity generation)
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the entity generation is still open.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;marked_duplicate&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether this is marked as a duplicate in the &lt;a href=&quot;#duplicate_summary&quot;&gt;&lt;code&gt;duplicate_summary&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;. This is a cache value intended to accelerate queries by requiring less joining.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text_pile_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#text-pile&quot;&gt;&lt;code&gt;text_pile&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, that contains the text content that this entity generation represents.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that this entity generation does not have any searchable text attached.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text_pile_v0_2_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#text-pile-v0-2&quot;&gt;&lt;code&gt;text_pile_v0_2&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, that contains the text content this entity generation represents.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Since &lt;code&gt;0.0.0-textrank-preview.0&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;There is a check that &lt;code&gt;first_seen_unix_utc &amp;lt;= last_seen_unix_utc&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Indices:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_by_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		On field &lt;code&gt;entity_generation_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Speeds up looking for entity generations by UUID.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_by_first_seen&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		On fields &lt;code&gt;url_id&lt;&#x2F;code&gt;, &lt;code&gt;url_fragment&lt;&#x2F;code&gt; and &lt;code&gt;first_seen_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Speeds up looking for entity generations based on when they were first seen for a given URL.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_by_text_pile&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		On fields &lt;code&gt;text_pile_id&lt;&#x2F;code&gt; and &lt;code&gt;confirmed_end_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Speeds up looking for non closed entity generations given a &lt;code&gt;text_pile_id&lt;&#x2F;code&gt;, this is needed for searching.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_by_active&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		On fields &lt;code&gt;marked_duplicate&lt;&#x2F;code&gt;, &lt;code&gt;confirmed_end_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Speeds up querying for (non) duplicates.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;duplicate-summary&quot;&gt;&lt;code&gt;duplicate_summary&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; There is no way to directly address the duplicate summaries, they are always referenced as being the attachments of an entity generation.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;duplicate_summary&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;subject_entity_generation_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;#entity-generation&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, of the entity generation that is flagged as duplicate.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;duplicate_of_entity_generation_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;#entity-generation&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, of the entity generation that the flagged entity generation is a duplicate of.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;duplicate_status_start_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt;  When the duplicate status started (usually the time it was detected)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;duplicate_status_end_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; When the duplicate status ended
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the duplicate status hasn&#x27;t ended yet.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;Indices:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;duplicate_summary_by_duplicate&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		On fields &lt;code&gt;subject_entity_generation_id&lt;&#x2F;code&gt; and &lt;code&gt;duplicate_status_end_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Speeds up looking if some entity generation is a non-expired duplicate.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;crawl-summary&quot;&gt;&lt;code&gt;crawl_summary&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;crawl_summary&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;crawl_summary_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the [&lt;code&gt;entity_generation&lt;&#x2F;code&gt; table], of the entity generation that this crawl resulted in.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;was_robotstxt_approved&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether the crawl was approved by a &lt;code&gt;robots.txt&lt;&#x2F;code&gt; file.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_type&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  Which &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;#crawl-type&quot;&gt;crawl type&lt;&#x2F;a&gt; resulted in this entry.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#uuid&#x27; &gt;UUID&lt;&#x2F;a&gt;  The UUID of the corresponding &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log entry&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;agent_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#uuid&#x27; &gt;UUID&lt;&#x2F;a&gt;  The UUID of the crawl agent that made the requests.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_started_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt;  The time when the crawl action was started.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;exit_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  The &lt;a href=&quot;&#x2F;data&#x2F;crawl_exit_code.md&quot;&gt;crawl exit code&lt;&#x2F;a&gt; of the crawl action.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;request_duration_ms&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; How long it took to execute the crawl action.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the measurement failed.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;Indices:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;crawl_summary_by_crawl_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		On field &lt;code&gt;crawl_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Speeds up looking for already integrated crawls via &lt;code&gt;test_has_crawl_summary_with_uuid_bulk()&lt;&#x2F;code&gt;.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;http-summary&quot;&gt;&lt;code&gt;http_summary&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;An entry in this table is only present if the crawl resulted in a HTTP response.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;http_summary&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;crawl_summary_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#crawl-summary&quot;&gt;&lt;code&gt;crawl_summary&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the extended crawl summary.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;status_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  The resulting &lt;a href=&quot;https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;HTTP&#x2F;Reference&#x2F;Status&quot;&gt;HTTP status code&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;etag&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The resulting HTTP ETag header.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that no ETag header was returned.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;file-summary&quot;&gt;&lt;code&gt;file_summary&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;file_summary&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;&lt;span class=&#x27;unique-badge&#x27;&gt;Unique&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#entity-generation&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the extended entity generation. (There can only be one file per entity generation)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;file_size&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The file size in bytes
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the file size is not known.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;mimetype_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;&#x2F;data&#x2F;base_database.md#mimetype&quot;&gt;&lt;code&gt;mimetype&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the files mimetype.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the mimetype is not known (i.e. when there was a conflict detected between claimed mimetype and file content)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;canonical_url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;&#x2F;data&#x2F;base_database.md#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL the file claims its canonical version to be at.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the file didn&#x27;t claim the be the non-canonical version of another resource.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;redirect-summary&quot;&gt;&lt;code&gt;redirect_summary&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;redirect_summary&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#entity-generation&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the extended entity generation. (There can only be one redirect per entity generation)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;to_url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;&#x2F;data&#x2F;base_database.md#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL the redirect points at.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;information_source&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  Where the information for this redirect came from, see &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;information-source&#x2F;&quot;&gt;Information Source&lt;&#x2F;a&gt; for possible values.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;by_security_policy&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether the redirect was because of a security policy.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		🔧 This field is currently &lt;strong&gt;unused&lt;&#x2F;strong&gt; and may be removed in the future.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;to_url_fragment&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Fragment part of the URL that the redirect points at.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the destination URL has no fragment part.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		🔧 This field is currently &lt;strong&gt;unused&lt;&#x2F;strong&gt; as there is no equivalent in the crawler database, it could be populated from HTML redirects.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;link-summary&quot;&gt;&lt;code&gt;link_summary&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;link_summary&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;link_summary_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#entity-generation&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the entity generation this link is part of.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;link_to_url&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;&#x2F;data&#x2F;base_database.md#url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt;, which URL the link points at.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;link_to_fragment&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Fragment part of the URL that the link points at.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the destination URL has no fragment part.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;rel_nofollow&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether the link was tagged with a &lt;code&gt;rel=&quot;nofollow&quot;&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;rel_me&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether the link was tagged with a &lt;code&gt;rel=&quot;me&quot;&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;rel_tag&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether the link was tagged with a &lt;code&gt;rel=&quot;tag&quot;&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_header&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_footer&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_aside&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_nav&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_form&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_main&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_article&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_section&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_table&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_figure&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_address&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_headline&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_list&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_paragraph&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;location signature&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;contains_headline&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#bool&#x27; &gt;Bool&lt;&#x2F;a&gt;  Whether the link contains a headline element.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;destination_type&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;link-destination-type&#x2F;&quot;&gt;link destination type&lt;&#x2F;a&gt; of this link
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the link destination type couldn&#x27;t be derived for some unknown reason.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;link_locality&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-enumeration&#x27; &gt;Integer &#x2F; Enumeration&lt;&#x2F;a&gt;  The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;link-locality&#x2F;&quot;&gt;link locality&lt;&#x2F;a&gt; describing how far reaching this link is.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;html_tag_name&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Varchar&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The name of the HTML element the link appeared in.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that the link wasn&#x27;t inside an HTML document or that there is no HTML equivalent for the link container (prefer using the &lt;code&gt;destination_type&lt;&#x2F;code&gt;).
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Varchar&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Human readable text that describes the link.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means the link didn&#x27;t have any text to describe itself.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;document-description&quot;&gt;&lt;code&gt;document_description&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; All data in this table is derived from the document on a &lt;strong&gt;best effort&lt;&#x2F;strong&gt; basis.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;document_description&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;entity_generation_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt; Reference to the &lt;a href=&quot;#entity-generation&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the entity generation extended by this document description.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_created_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; When the document was created.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that this information is unavailable.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_updated_unix_utc&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer-timestamp&#x27; &gt;Integer &#x2F; Timestamp&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; When the document was last updated.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that this information is unavailable.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;primary_language&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Varchar&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The language code that represents the primary document language. &lt;strong&gt;should&lt;&#x2F;strong&gt; be a member of the &lt;a href=&quot;https:&#x2F;&#x2F;www.iana.org&#x2F;assignments&#x2F;language-subtag-registry&#x2F;language-subtag-registry&quot;&gt;iana language subtag registry&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that this information is unavailable.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;title&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The title of the document that would be appropriate to display on a browser tab.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that this information is unavailable.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;primary_headline&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The main headline of the document. This should not be the site name.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that this information is unavailable.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;description&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; A short description of the document.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that this information is unavailable.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;indexiness&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dt&gt;
		[Integer] The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;calculating-indexiness&#x2F;&quot;&gt;indexiness score&lt;&#x2F;a&gt; of the document.
		&lt;&#x2F;dt&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;text-pile&quot;&gt;&lt;code&gt;text_pile&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;text_pile&lt;&#x2F;code&gt; table stores &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;text piles Gen1&lt;&#x2F;a&gt;, it has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;text_pile_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;blake2b512_digest&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#blob&#x27; &gt;Blob&lt;&#x2F;a&gt; &lt;span class=&#x27;unique-badge&#x27;&gt;Unique&lt;&#x2F;span&gt; The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;#digest&quot;&gt;text pile digest&lt;&#x2F;a&gt; that is used for &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#exact-duplicates&quot;&gt;exact duplicate&lt;&#x2F;a&gt; detection, it also makes sure each possible text pile is only stored once.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Main file content, empty string if not present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;secondary_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Secondary file content, empty string if not present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;big_headlines&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Bigger headlines, empty string if not present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;small_headlines&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Smaller headlines, empty string if not present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;code_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Text marked up as code, empty string if not present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;quote_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Text marked up as quote, empty string if not present
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;text-pile-v0-2&quot;&gt;&lt;code&gt;text_pile_v0_2&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;text_pile_v0_2&lt;&#x2F;code&gt; table stores &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;text piles Gen2&lt;&#x2F;a&gt;, it has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;text_pile_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;blake2b512_digest&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#blob&#x27; &gt;Blob&lt;&#x2F;a&gt; &lt;span class=&#x27;unique-badge&#x27;&gt;Unique&lt;&#x2F;span&gt; The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;#digest&quot;&gt;text pile Gen2 digest&lt;&#x2F;a&gt; that is used for &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#exact-duplicates&quot;&gt;exact duplicate&lt;&#x2F;a&gt; detection, it also makes sure each possible text pile Gen2 is only stored once.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The raw text behind the text pile Gen2.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;metadata&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#blob&#x27; &gt;Blob&lt;&#x2F;a&gt;  The text pile Gen2 metadata encoded as postcard.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;segmentation_cache&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#blob&#x27; &gt;Blob&lt;&#x2F;a&gt;  The segmentation metadata of the text pile Gen2, non optional.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;&lt;b&gt;Note on the &lt;code&gt;segmentation_cache&lt;&#x2F;code&gt;:&lt;&#x2F;b&gt; The segmentation cache in the database is deliberately non optional to make handling of text data easier on the consumer side. This is contrary to the segmentation cache being an optional component of a text pile Gen2 elsewhere.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;token&quot;&gt;&lt;code&gt;token&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;token&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;token_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;token_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;unique-badge&#x27;&gt;Unique&lt;&#x2F;span&gt; The text represented by this token.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;token-statistics&quot;&gt;&lt;code&gt;token_statistics&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;token_statistics&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;token_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;#token&quot;&gt;&lt;code&gt;token&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the token this is about.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text_pile_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the &lt;a href=&quot;#text-pile&quot;&gt;&lt;code&gt;text_pile&lt;&#x2F;code&gt; table&lt;&#x2F;a&gt; of the text pile this is about.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;occurances&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  How often the token occurs in the text pile.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;The combination of &lt;code&gt;token_id&lt;&#x2F;code&gt; and &lt;code&gt;text_pile_id&lt;&#x2F;code&gt; must be unique.&lt;&#x2F;p&gt;
&lt;p&gt;🔧 This table is work in progress man may be subject to major changes in the future.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;full-text-entity-index&quot;&gt;&lt;code&gt;full_text_entity_index&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;This table is backed by the &lt;code&gt;fts5&lt;&#x2F;code&gt; SQLite extension. It is mainly derived from the &lt;code&gt;text_pile&lt;&#x2F;code&gt; and &lt;code&gt;document_description&lt;&#x2F;code&gt; tables.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;full_text_entity_index&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;text_pile_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference to the [&lt;code&gt;text_pile&lt;&#x2F;code&gt; table] of the text pile this was derived from.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;any&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Always contains the string &quot;any&quot;, this is needed to work around limitations of the fts5 &lt;code&gt;NOT&lt;&#x2F;code&gt; operator.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;title&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The document title taken from the &lt;code&gt;document_description&lt;&#x2F;code&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;description&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The description taken from the &lt;code&gt;document_description&lt;&#x2F;code&gt;.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Null means that no description is available.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The text taken from the &lt;code&gt;text_pile&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;secondary_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The secondary text taken from the &lt;code&gt;text_pile&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;big_headlines&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The big headlines taken from the &lt;code&gt;text_pile&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;small_headlines&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The small headlines taken from the &lt;code&gt;text_pile&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;code_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The code text taken from the &lt;code&gt;text_pile&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;quote_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The quote text taken from the &lt;code&gt;text_pile&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;version-history&quot;&gt;Version History&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;0-0-0-the-first-versioned&quot;&gt;0.0.0 - The first versioned&lt;&#x2F;h3&gt;
&lt;p&gt;This is the first versioned and documented version of this database.i&lt;&#x2F;p&gt;
&lt;p&gt;This schema is released with unobtanium 3.0.0.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;0-0-0-textrank-preview-0&quot;&gt;0.0.0-textrank-preview.0&lt;&#x2F;h3&gt;
&lt;p&gt;This is an unstable development version to introduce and test the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;Text Pile Gen2&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>unobtanium-viewer</title>
        <published>2025-08-10T00:00:00+00:00</published>
        <updated>2025-08-10T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/unobtanium-viewer/"/>
        <id>https://doc.unobtanium.rocks/manual/unobtanium-viewer/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/unobtanium-viewer/">&lt;p&gt;The &lt;code&gt;unobtanium-viewer&lt;&#x2F;code&gt; takes a summary database and makes it acessible as a search frontend.&lt;&#x2F;p&gt;
&lt;p&gt;It exports an HTTP socket you should put it behind a reverse proxy.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;synopsis&quot;&gt;Synopsis&lt;&#x2F;h2&gt;
&lt;pre&gt;&lt;code&gt;unobtanium-viewer [OPTIONS...]
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;options&quot;&gt;Options&lt;&#x2F;h2&gt;
&lt;p&gt;Accepted options are:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;-s&lt;&#x2F;code&gt;, &lt;code&gt;--summary-db&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file.db&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Path to the summary database.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Default is &lt;code&gt;summary.db&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;-t&lt;&#x2F;code&gt;, &lt;code&gt;--template-location&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;template-directory&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Path to the &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;src&#x2F;branch&#x2F;main&#x2F;viewer&#x2F;templates&quot;&gt;template directory&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Default is &lt;code&gt;&#x2F;templates&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;-e&lt;&#x2F;code&gt;, &lt;code&gt;--extra-config&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;extra.toml&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Path to the template configuration file. Its content depends on the template. An example &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;src&#x2F;branch&#x2F;main&#x2F;viewer&#x2F;templates&#x2F;extra.toml&quot;&gt;file for the default template can be found in the source&lt;&#x2F;a&gt;, make sure to keep this one updated.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Default is &lt;code&gt;extra.toml&lt;&#x2F;code&gt; inside the template location
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;-l&lt;&#x2F;code&gt;, &lt;code&gt;--listen-on&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;ip:port&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		HTTP Socket to listen on.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Default is &lt;code&gt;127.0.0.1:3000&lt;&#x2F;code&gt; (local machine only, tcp port 3000)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;--request-workers&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;n&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		How many worker threads to keep available for searching, you should keep this number below the number of available CPU cores. This determines the number of concurrent queries that unobtanium can process.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Default is &lt;code&gt;3&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;--help&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Print help text describing the available options and exit.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;http-headers&quot;&gt;HTTP headers&lt;&#x2F;h2&gt;
&lt;p&gt;It is recommended to set the following security headers in your reverse proxy:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code&gt;X-Content-Type-Options &amp;quot;nosniff&amp;quot;
Referrer-Policy &amp;quot;strict-origin&amp;quot;
permissions-policy &amp;quot;interest-cohort=()&amp;quot;
Content-Security-Policy &amp;quot;default-src &amp;#x27;none&amp;#x27;; script-src &amp;#x27;self&amp;#x27; blob: ; style-src &amp;#x27;self&amp;#x27;; img-src &amp;#x27;self&amp;#x27;; media-src &amp;#x27;self&amp;#x27;; base-uri &amp;#x27;none&amp;#x27;; form-action &amp;#x27;self&amp;#x27; echoip.slatecave.net ; frame-ancestors &amp;#x27;none&amp;#x27;&amp;quot;
Cache-Control &amp;quot;max-age=3600, min-fresh=600, stale-if-error=86400, must-revalidate&amp;quot;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; Currently the unobtanium frontend contains no javascript, this may change in the future to enable progresive enhancements or interactive features.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;sqlite-caching&quot;&gt;SQLite caching&lt;&#x2F;h2&gt;
&lt;p&gt;Unobtanium itself implements no caching mechanisms, keeping a bunch of memory free for the Linux kernel to keep frequently needed parts of the database in RAM is recommended and will greatly improve the query speed.&lt;&#x2F;p&gt;
&lt;p&gt;Plan keeping at least 10% of the summary database size as &quot;usused&quot; RAM so that the kernel can use it for caching.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;serving-static-files-directly&quot;&gt;Serving static files directly&lt;&#x2F;h2&gt;
&lt;p&gt;Inside the template directory there is a &lt;code&gt;static&lt;&#x2F;code&gt; directory which cn be served by your web server directly by overlaying it on top of the reverse proxy functionality.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; If you don&#x27;t do this you&#x27;re fine too, the unobtanium-viewer will serve them with the same functionality.&lt;&#x2F;p&gt;
&lt;p&gt;Example configuration for the Caddy web server assuming the template directory is &lt;code&gt;&#x2F;opt&#x2F;unobtaniun&#x2F;default-template&#x2F;&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code&gt;your.domain {
	[… other configuration …]

	# Static file matcher
	@isStatic file {
		root &amp;#x2F;opt&amp;#x2F;unobtaniun&amp;#x2F;default-template&amp;#x2F;static&amp;#x2F;
	} 

	# Static file handler
	handle @isStatic {
		root * &amp;#x2F;opt&amp;#x2F;unobtaniun&amp;#x2F;default-template&amp;#x2F;static&amp;#x2F;
		file_server
	}

	# Reverse proxy
	reverse_proxy * localhost:11006 {
		header_up Host search.slatecave.net
	}
}
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;anubis&quot;&gt;Anubis&lt;&#x2F;h2&gt;
&lt;p&gt;In case you want to put the &lt;a href=&quot;https:&#x2F;&#x2F;anubis.techaro.lol&#x2F;&quot;&gt;Anubis scraper blocker&lt;&#x2F;a&gt; in front of the viewer because you&#x27;re reciving a high volume of automated queries, the default configuration for anubis works fine.&lt;&#x2F;p&gt;
&lt;p&gt;It should not be necessary though, unobtanium can handle high volumes of requests pretty well.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Database Info</title>
        <published>2025-08-09T00:00:00+00:00</published>
        <updated>2025-08-09T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/database-info/"/>
        <id>https://doc.unobtanium.rocks/data/database-info/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/database-info/">&lt;p&gt;The &lt;code&gt;unonatanium_database_info&lt;&#x2F;code&gt; table is part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;base-database&#x2F;&quot;&gt;Base Database Schema&lt;&#x2F;a&gt;, it contains schema versioning metadata about a given database.&lt;&#x2F;p&gt;
&lt;p&gt;Valid keys are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt id=&quot;key-unobtanium-base-schema-version&quot;&gt;
		&lt;code&gt;unobtanium_base_schema_version&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The version of the base schema i.e. &lt;code&gt;0.1.0&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;key-unobtanium-database-kind&quot;&gt;
		&lt;code&gt;unobtanium_database_kind&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Which kind of database this is, either &lt;code&gt;crawler&lt;&#x2F;code&gt; or &lt;code&gt;summary&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;key-unobtanium-crawler-schema-version&quot;&gt;
		&lt;code&gt;unobtanium_crawler_schema_version&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The version of the crawler database schema i.e. &lt;code&gt;1.0.0&lt;&#x2F;code&gt; (only if the database is a crawler database)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;key-unobtanium-summary-schema-version&quot;&gt;
		&lt;code&gt;unobtanium_summary_schema_version&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The version of the summary database schema i.e. &lt;code&gt;0.0.0&lt;&#x2F;code&gt; (only if the database is a summary database)
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; The base schema may increment without the summary or crawler schema getting a version bump.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;manually-adding-version-information&quot;&gt;Manually adding version information&lt;&#x2F;h2&gt;
&lt;p&gt;In case you have an &quot;old&quot; database run the following SQL manually.&lt;&#x2F;p&gt;
&lt;p&gt;For this to work the databases must have the right schema, otherwise they&#x27;ll fail like they did previously: Error on some random SQL statement that happens to not work anymore.&lt;&#x2F;p&gt;
&lt;p&gt;For &lt;strong&gt;crawler&lt;&#x2F;strong&gt; databases:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sql&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;INSERT INTO unobtanium_database_info(key,value) VALUES
	(&amp;#x27;unobtanium_base_schema_version&amp;#x27;, &amp;#x27;0.1.0&amp;#x27;),
	(&amp;#x27;unobtanium_database_kind&amp;#x27;, &amp;#x27;crawler&amp;#x27;),
	(&amp;#x27;unobtanium_summary_schema_version&amp;#x27;,&amp;#x27;1.0.0&amp;#x27;);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For &lt;strong&gt;summary&lt;&#x2F;strong&gt; databases:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sql&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;INSERT INTO unobtanium_database_info(key,value) VALUES
	(&amp;#x27;unobtanium_base_schema_version&amp;#x27;, &amp;#x27;0.1.0&amp;#x27;),
	(&amp;#x27;unobtanium_database_kind&amp;#x27;, &amp;#x27;summary&amp;#x27;),
	(&amp;#x27;unobtanium_summary_schema_version&amp;#x27;,&amp;#x27;0.0.0&amp;#x27;);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawl Delay between Requests</title>
        <published>2025-08-06T00:00:00+00:00</published>
        <updated>2025-08-06T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/algorithm/crawl-delay/"/>
        <id>https://doc.unobtanium.rocks/algorithm/crawl-delay/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/algorithm/crawl-delay/">&lt;p&gt;The crawl delay is the time between requests the crawler waits to avoid causing too much load on web servers.&lt;&#x2F;p&gt;
&lt;p&gt;Choosing the crawl delay is always a compromise between not causing too much load on a server and getting the crawling done as fast as possible.&lt;&#x2F;p&gt;
&lt;p&gt;Currently unobtanium runs once crawler for each &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;url&#x2F;#origin&quot;&gt;Origin&lt;&#x2F;a&gt;, each with its own &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;crawl loop&lt;&#x2F;a&gt; and its own crawl delay calculation.&lt;&#x2F;p&gt;
&lt;p&gt;The crawl delay is implemented in &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;src&#x2F;branch&#x2F;main&#x2F;crawler&#x2F;src&#x2F;crawler&#x2F;crawl_delay.rs&quot;&gt;unobtanium&#x2F;crawler&#x2F;src&#x2F;crawler&#x2F;crawl_delay.rs&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;calculating-the-crawl-delay&quot;&gt;Calculating the Crawl Delay&lt;&#x2F;h2&gt;
&lt;p&gt;Unobtanium chooses from multiple delays:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;A minimum delay&lt;&#x2F;li&gt;
&lt;li&gt;A politeness based delay&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Of those the longer one will be chosen as the wait time until the next request.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-minimum-delay&quot;&gt;The Minimum Delay&lt;&#x2F;h3&gt;
&lt;p&gt;The minimum delay is either the &lt;code&gt;Crawl-Delay&lt;&#x2F;code&gt; from &lt;code&gt;robots.txt&lt;&#x2F;code&gt; capped at 2 minutes if available or a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#option-default-delay-ms&quot;&gt;configured minimum delay&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-politeness-based-delay&quot;&gt;The Politeness based Delay&lt;&#x2F;h3&gt;
&lt;p&gt;The politeness mechanism is &lt;a href=&quot;https:&#x2F;&#x2F;stract.com&#x2F;webmasters&quot;&gt;taken from the stract crawler&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In general it is based on taking how long the last request took to respond and multiplying it by &lt;code&gt;2&lt;sup&gt;politeness&lt;&#x2F;sup&gt;&lt;&#x2F;code&gt;. The politeness &lt;i&gt;factor&lt;&#x2F;i&gt; starts at 2 and then decreases, but not below 0.&lt;&#x2F;p&gt;
&lt;p&gt;If a HTTP &lt;code&gt;429&lt;&#x2F;code&gt; reponse is received, the politeness factor will be incremented by 1 (doubling the wait time) and autodecrementing will be turned off until the end of the crawl run.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Boolean</title>
        <published>2025-07-26T00:00:00+00:00</published>
        <updated>2025-07-26T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/criteria/boolean/"/>
        <id>https://doc.unobtanium.rocks/criteria/boolean/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/criteria/boolean/">&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;criterium&#x2F;latest&#x2F;criterium&#x2F;boolean&#x2F;enum.BooleanCriterium.html&quot;&gt;Bolean matching criterium from the &lt;code&gt;criterium&lt;&#x2F;code&gt; crate&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The boolean matching criterium has the following variants:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;equals&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a boolean as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the boolean to match equals the given value.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;is_none&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes no argument
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the boolean to match is a None value.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Criterium Chain</title>
        <published>2025-07-26T00:00:00+00:00</published>
        <updated>2025-07-26T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/criteria/criterium-chain/"/>
        <id>https://doc.unobtanium.rocks/criteria/criterium-chain/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/criteria/criterium-chain/">&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;criterium&#x2F;latest&#x2F;criterium&#x2F;chain&#x2F;enum.CriteriumChain.html&quot;&gt;Criterium Chain from the &lt;code&gt;criterium&lt;&#x2F;code&gt; crate&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The criterium chain is used to combine multiple criteria of the same type using boolean logic. Multiple chains can be nested.&lt;&#x2F;p&gt;
&lt;p&gt;Which criterium is used can be found near the documentation that mentions that a criterium chain is accepted for some value as this is usecase dependent.&lt;&#x2F;p&gt;
&lt;p&gt;You can use the variants of the contained criterium directly in place of where the chain would go. Under the hood this will generate a chain of the &lt;code&gt;match&lt;&#x2F;code&gt; variant tha matches when the contained criterium matches.&lt;&#x2F;p&gt;
&lt;p&gt;The number matching criterium has the following variants:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
			
			
				
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;and&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a list of chains of the same type as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if all of the given chains match.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;or&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a list of chains of the same type as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if at least one of the given chains match.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;not&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a single chain as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the contained chain does not match.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;always&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes no argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Always matches.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;never&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes no argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Never matches.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;not_match&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a criterium matching the chain type as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the contained criterium does not match.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;match&lt;&#x2F;code&gt; (invisible variant)
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a criterium matching the chain type as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the contained criterium marches, this one is just a type adaptor.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		This variant is &lt;strong&gt;invisible&lt;&#x2F;strong&gt; (untagged), you can use the argument it would take directly in place of this variant and it will be contsructed auutomatically.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Number</title>
        <published>2025-07-26T00:00:00+00:00</published>
        <updated>2025-07-26T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/criteria/number/"/>
        <id>https://doc.unobtanium.rocks/criteria/number/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/criteria/number/">&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;criterium&#x2F;latest&#x2F;criterium&#x2F;number&#x2F;enum.NumberCriterium.html&quot;&gt;Number matching criterium from the &lt;code&gt;criterium&lt;&#x2F;code&gt; crate&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Number matching works th same on integers, floats, signed and unsigned numbers.&lt;&#x2F;p&gt;
&lt;p&gt;The number matching criterium has the following variants:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;equals&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a number as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the number to match equals the given number.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;less_than&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a number as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the number to match is less than the given number.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;less_than_or_equal&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a number as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the number to match is less than or equal to the given number.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;greater_than&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a number as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the number to match is greater than the given number.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;greater_than_or_equal&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a number as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the number to match is greater than or equal to the given number.i
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;in_list&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a list of numbers as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the number to match is equal to one of the given numbers.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;is_none&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes no argument
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the number to match is a None value
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Origin</title>
        <published>2025-07-26T00:00:00+00:00</published>
        <updated>2025-07-26T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/criteria/origin/"/>
        <id>https://doc.unobtanium.rocks/criteria/origin/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/criteria/origin/">&lt;p&gt;This criterium is for matching &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;url&#x2F;#origin&quot;&gt;URL orgins&lt;&#x2F;a&gt; in unobtanium.&lt;&#x2F;p&gt;
&lt;p&gt;The number matching criterium has the following variants:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt id=&quot;scheme&quot;&gt;
		&lt;code&gt;scheme&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;string&#x2F;&quot;&gt;String Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the origins scheme against the given string criterium.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;host&quot;&gt;
		&lt;code&gt;host&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;string&#x2F;&quot;&gt;String Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the origins host against the given string criterium.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;port&quot;&gt;
		&lt;code&gt;port&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;string&#x2F;&quot;&gt;Number Criterium&lt;&#x2F;a&gt; for a &lt;code&gt;u16&lt;&#x2F;code&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the origins port against the given number criterium.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
    <entry xml:lang="en">
        <title>String</title>
        <published>2025-07-26T00:00:00+00:00</published>
        <updated>2025-07-26T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/criteria/string/"/>
        <id>https://doc.unobtanium.rocks/criteria/string/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/criteria/string/">&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;criterium&#x2F;latest&#x2F;criterium&#x2F;string&#x2F;enum.StringCriterium.html&quot;&gt;String matching criterium from the &lt;code&gt;criterium&lt;&#x2F;code&gt; crate&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The String criterium has the following variants:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;equals&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a string as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the string to match equals the given string.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;has_prefix&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a string as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the string to match has the gven string as prefix.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;has_suffix&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a string as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the string to match has the gven string as suffix.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;contains&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a string as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the string to match contains the given string as a substring.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;length&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;number&#x2F;&quot;&gt;Number Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the length of the string in unicode codepoints to match against the given Number Criterium.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;is_none&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes no argument
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the string to match is None
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Url</title>
        <published>2025-07-26T00:00:00+00:00</published>
        <updated>2025-07-26T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/criteria/url/"/>
        <id>https://doc.unobtanium.rocks/criteria/url/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/criteria/url/">&lt;p&gt;A criterium for matchain against &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;url&#x2F;&quot;&gt;URLs&lt;&#x2F;a&gt; in unobtanium.&lt;&#x2F;p&gt;
&lt;p&gt;The URL matching criterium has the following variants:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt id=&quot;user&quot;&gt;
		&lt;code&gt;user&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;string&#x2F;&quot;&gt;String Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the URLs user part against the given string criterium.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Defaults to an empty string, will never be a none value.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;password&quot;&gt;
		&lt;code&gt;password&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;string&#x2F;&quot;&gt;String Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the URLs password part against the given string criterium.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Defaults to a none value.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;path&quot;&gt;
		&lt;code&gt;path&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;string&#x2F;&quot;&gt;String Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the URLs path part against the given string criterium.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Defaults to an empty string, will never be a none value.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;query&quot;&gt;
		&lt;code&gt;query&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;string&#x2F;&quot;&gt;String Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the URLs query part against the given string criterium.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Defaults to a none value.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;fragment&quot;&gt;
		&lt;code&gt;fragment&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;string&#x2F;&quot;&gt;String Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches the URLs fragment part against the given string criterium.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Defaults to a none value.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;scheme&quot;&gt;
		&lt;code&gt;scheme&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;origin&#x2F;#scheme&quot;&gt;Same as &lt;code&gt;scheme&lt;&#x2F;code&gt; on the Origin criterium&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;host&quot;&gt;
		&lt;code&gt;host&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;origin&#x2F;#host&quot;&gt;Same as &lt;code&gt;host&lt;&#x2F;code&gt; on the Origin criterium&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;port&quot;&gt;
		&lt;code&gt;port&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;origin&#x2F;#port&quot;&gt;Same as &lt;code&gt;port&lt;&#x2F;code&gt; on the Origin criterium&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;equals&quot;&gt;
		&lt;code&gt;equals&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes an URL as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the URL to match is equals the given one.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;same-document-as&quot;&gt;
		&lt;code&gt;same_document_as&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes an URL without a fragment part as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the URL to match equals the given URL ignoring the fragment part.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawler crawl configuration</title>
        <published>2025-07-26T00:00:00+00:00</published>
        <updated>2025-07-26T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/crawler-crawl-configuration/"/>
        <id>https://doc.unobtanium.rocks/manual/crawler-crawl-configuration/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/crawler-crawl-configuration/">&lt;p&gt;This page documents the configuration for the &lt;code&gt;unobtanium-crawler crawl&lt;&#x2F;code&gt; subcommand that is given with the &lt;code&gt;--config&lt;&#x2F;code&gt; option.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;New here?&lt;&#x2F;b&gt; &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;your-first-search-engine&#x2F;&quot;&gt;A guided tour for your first search engine&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;More Examples:&lt;&#x2F;b&gt;
Examples of working configuration files can be found in the &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-rocks-index&#x2F;src&#x2F;branch&#x2F;main&#x2F;index-config&quot;&gt;unobtanium.
rocks index configuration&lt;&#x2F;a&gt;. (Those file are behind the index on &lt;a href=&quot;https:&#x2F;&#x2F;unobtanium.rocks&quot;&gt;unobtanum.rocks&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;Available settings are:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt id=&quot;option-name&quot;&gt;
		&lt;code&gt;name&lt;&#x2F;code&gt; (optional)
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Human readable name of the index for documentation purposes
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;option-description&quot;&gt;
		&lt;code&gt;description&lt;&#x2F;code&gt; (optional)
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Human readable description of the index for documentation purposes. When writing rich text here format it as markdown.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;option-database-file&quot;&gt;
		&lt;code&gt;database_file&lt;&#x2F;code&gt; (optional)
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		If given, the crawler database files path must end with the same path segments as given here, otherwise the crawler will throw an error. This is part of a safety mechanism to prevent mixing up multiple databases.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		The &lt;code&gt;--ignore-db-name-from-config&lt;&#x2F;code&gt; option can be used to ignore this setting.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;option-default-delay-ms&quot;&gt;
		&lt;code&gt;default_delay_ms&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The default crawl delay in milliseconds between requests to the same site, if the sites &lt;code&gt;robots.txt&lt;&#x2F;code&gt; didn&#x27;t request a different one. Setting this from &lt;code&gt;1000&lt;&#x2F;code&gt; to &lt;code&gt;2000&lt;&#x2F;code&gt; milliseconds should be a sane default. Setting more here is more polite, but it will take longer to crawl.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;option-max-commands-per-run&quot;&gt;
		&lt;code&gt;max_commands_per_run&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The number of crawling commands the crawler will execute before exiting automatically. A crawling command is roughly equivalent to one request. This stops the crawler from running too long. Setting this to &lt;code&gt;60&lt;&#x2F;code&gt; to &lt;code&gt;100&lt;&#x2F;code&gt; &lt;strong&gt;per indexed site&lt;&#x2F;strong&gt; is a sane default.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		The &lt;code&gt;--max-commands&lt;&#x2F;code&gt; option can be used to override this setting.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;option-recrawl-interval&quot;&gt;
		&lt;code&gt;recrawl_interval&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		How long the crawler will wait until it schedules a site that has already been crawled successfully for being crawled again in case it was updated. A sane default would be &lt;code&gt;4 week&lt;&#x2F;code&gt; but can be more or less depending on the usecase. Guaranteed supported suffixes are &lt;code&gt;second&lt;&#x2F;code&gt;, &lt;code&gt;minute&lt;&#x2F;code&gt;, &lt;code&gt;hour&lt;&#x2F;code&gt;, &lt;code&gt;day&lt;&#x2F;code&gt;, &lt;code&gt;week&lt;&#x2F;code&gt;, &lt;code&gt;month&lt;&#x2F;code&gt;, &lt;code&gt;year&lt;&#x2F;code&gt;. Other suffixes might work but are not guaranteed to be supported long term.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		The &lt;code&gt;--force-recrawl&lt;&#x2F;code&gt; option can be used to ignore this setting.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;option-user-agent&quot;&gt;
		&lt;code&gt;user_agent&lt;&#x2F;code&gt; (optional, but highly recommended)
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Configure the user agent that will be used while crawling. See the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-user-agent-and-robots-txt&#x2F;&quot;&gt;Crawler User-Agent and robots.txt&lt;&#x2F;a&gt; page for more information.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Can be overridden using the &lt;code&gt;--user-agent&lt;&#x2F;code&gt; option.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;option-seeds&quot;&gt;
		&lt;code&gt;seeds&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		List of URLs that are the entry points for the crawler. The crawler will start at these and follow links from there given they point to an origin that is part of one of these seed URLs. See the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;Crawling Loop algorithm&lt;&#x2F;a&gt; page for more information.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Can be overridden using the &lt;code&gt;--schedule&lt;&#x2F;code&gt; option.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;option-do-not-crawl&quot;&gt;
		&lt;code&gt;do_not_crawl&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		List of policies to limit which pages are crawled on top of a sites own &lt;code&gt;robots.txt&lt;&#x2F;code&gt;.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;policies&quot;&gt;Policies&lt;&#x2F;h2&gt;
&lt;p&gt;Policies are lists of configuration objects. The canonical way to write them is to use thee &lt;a href=&quot;https:&#x2F;&#x2F;toml.io&#x2F;en&#x2F;v1.0.0#array-of-tables&quot;&gt;Array of Tables&lt;&#x2F;a&gt; TOML syntax at the end of the file.&lt;&#x2F;p&gt;
&lt;p&gt;See the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;Crawling Loop algorithm&lt;&#x2F;a&gt; page on the impact of crawl policies.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;do-not-crawl&quot;&gt;&lt;code&gt;do_not_crawl&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;do_not_crawl&lt;&#x2F;code&gt; Policy can inhibit crawling when a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#crawl-policy-criterium&quot;&gt;Crawl Policy Criterium&lt;&#x2F;a&gt; matches.&lt;&#x2F;p&gt;
&lt;p&gt;Settings for the do not crawl policy are:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;max-one-dd&quot;&gt;

	
		&lt;dt id=&quot;do-not-crawl--reason&quot;&gt;
		&lt;code&gt;reason&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Documents the reason this policy exists, usually a human readable version of the rule.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;do-not-crawl--if&quot;&gt;
		&lt;code&gt;if&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The serialised &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;criterium-chain&#x2F;&quot;&gt;criterium chain&lt;&#x2F;a&gt; carrying crawl policy criteria that will prevent crawling if matched. (Example: &lt;code&gt;if.url.path.has_prefix = &quot;login&quot;&lt;&#x2F;code&gt;)
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;Example policy that prevents crawling of the issue tracker on &lt;code&gt;osmocom.org&lt;&#x2F;code&gt; that is under the path &lt;code&gt;&#x2F;issues&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;toml&quot; class=&quot;language-toml &quot;&gt;&lt;code class=&quot;language-toml&quot; data-lang=&quot;toml&quot;&gt;[[do_not_crawl]]
reason=&amp;quot;Don&amp;#x27;t crawl the osmocom issue tracker&amp;quot;
if.and = [
	{ url.host.equals = &amp;quot;osmocom.org&amp;quot; },
	{ url.path.has_prefix = &amp;quot;&amp;#x2F;issues&amp;quot; },
]
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;url-query-paramters&quot;&gt;&lt;code&gt;url_query_paramters&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;This crawl policy can be used to allow a combination of URL query parameters when a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#crawl-policy-criterium&quot;&gt;Crawl Policy Criterium&lt;&#x2F;a&gt; matches.&lt;&#x2F;p&gt;
&lt;p&gt;By default URLs with query parameters are ignored as the usually point to page variants that are not useful for search applications.&lt;&#x2F;p&gt;
&lt;p&gt;Not all of the allowed parameters have to be present. If additional parameters are present, that URL won&#x27;t be crawled unless another rule allows that combination.&lt;&#x2F;p&gt;
&lt;p&gt;Rules do not combine their allow lists. Example: If one rule for a page only allows &lt;code&gt;foo&lt;&#x2F;code&gt; and another only allows &lt;code&gt;bar&lt;&#x2F;code&gt; an URL containing both &lt;code&gt;foo&lt;&#x2F;code&gt; and &lt;code&gt;bar&lt;&#x2F;code&gt; parameters is still not allowed.&lt;&#x2F;p&gt;
&lt;p&gt;Settings for the URL query parameters policy are:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;max-one-dd&quot;&gt;

	
		&lt;dt id=&quot;url-query-parameters--reason&quot;&gt;
		&lt;code&gt;reason&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The reason this policy exists, usually a why this combination of query parameters is allowed.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;url-query-parameters--allow&quot;&gt;
		&lt;code&gt;allow&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		List of URL query parameters to allow. (Example: &lt;code&gt;allow = [&quot;page&quot;]&lt;&#x2F;code&gt;)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;url-query-parameters--if&quot;&gt;
		&lt;code&gt;if&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The serialised &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;criterium-chain&#x2F;&quot;&gt;criterium chain&lt;&#x2F;a&gt; carrying crawl policy criteria that will activate this policy when matched.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;Example policy that allows the crawler to use the &lt;code&gt;page&lt;&#x2F;code&gt; parameter on the site &lt;code&gt;theorangeone.net&lt;&#x2F;code&gt; to navigate the paginated index pages:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;toml&quot; class=&quot;language-toml &quot;&gt;&lt;code class=&quot;language-toml&quot; data-lang=&quot;toml&quot;&gt;[[url_query_paramters]]
if.url.host.equals = &amp;quot;theorangeone.net&amp;quot;
allow = [&amp;quot;page&amp;quot;]
reason = &amp;quot;used for pagination&amp;quot;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;crawl-policy-criterium&quot;&gt;Crawl Policy Criterium&lt;&#x2F;h3&gt;
&lt;p&gt;The crawl policy criterium can match page metadata that is available &lt;strong&gt;before&lt;&#x2F;strong&gt; the request happened at the scheduling stage.&lt;&#x2F;p&gt;
&lt;p&gt;The crawl policy criterium has the follwing variants&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;url&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Takes a &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;url&#x2F;&quot;&gt;URL Criterium&lt;&#x2F;a&gt; as argument.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Matches if the URL of the to be crawled page matches the given URL criterium.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
    <entry xml:lang="en">
        <title>unobtanium-crawler</title>
        <published>2025-07-26T00:00:00+00:00</published>
        <updated>2025-07-26T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/unobtanium-crawler/"/>
        <id>https://doc.unobtanium.rocks/manual/unobtanium-crawler/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/unobtanium-crawler/">&lt;p&gt;The &lt;code&gt;unobtanium-crawler&lt;&#x2F;code&gt; collects data from the web and summarizes it.&lt;&#x2F;p&gt;
&lt;p&gt;It mainatains both the crawler and summary databases.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;synopsis&quot;&gt;Synopsis&lt;&#x2F;h2&gt;
&lt;p&gt;The crawler is split up into many subcommands.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;unobtanium-crawler crawl [OPTIONS]...&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;unobtanium-crawler summarize [OPTIONS]...&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;unobtanium-crawler optimize-db --database &amp;lt;database-file&amp;gt;&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;unobtanium-crawler regenerate-token-index [OPTIONS]...&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;unobtanium-crawler delete [S-SUBCOMMAND]&lt;&#x2F;code&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;old-crawl-log-entries [OPTIONS]...&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;unobtanium-crawler debug [SUBCOMMAND]&lt;&#x2F;code&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;indexiness [OPTIONS]...&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;query-crawl-log [OPTIONS]...&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;sqlite-version&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;crawl&quot;&gt;&lt;code&gt;crawl&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;The &lt;code&gt;crawl&lt;&#x2F;code&gt; subcommand starts the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;Crawl Loop&lt;&#x2F;a&gt; for a given &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;&quot;&gt;crawl configuration&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; For testing the crawler can be configured with command line options only, however this setup isn&#x27;t recommended for long term deployments.&lt;&#x2F;p&gt;
&lt;p&gt;Accepted options are:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt id=&quot;crawl-c-database&quot;&gt;
		&lt;code&gt;-c&lt;&#x2F;code&gt;, &lt;code&gt;--database&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The crawler database file to store the crawl results in.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		If the file doesn&#x27;t exist yet, it will be created.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-u-user-agent&quot;&gt;
		&lt;code&gt;-u&lt;&#x2F;code&gt;, &lt;code&gt;--user-agent&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;user_agent&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Set the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-user-agent-and-robots-txt&#x2F;&quot;&gt;user agent&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Overrides the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#option-user-agent&quot;&gt;&lt;code&gt;user_agent&lt;&#x2F;code&gt; setting from the configuration file&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-w-worker-name&quot;&gt;
		&lt;code&gt;-w&lt;&#x2F;code&gt;, &lt;code&gt;--worker-name&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;name&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Set the worker name to be logged to the database.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Default is &lt;code&gt;ant&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-m-max-commands&quot;&gt;
		&lt;code&gt;-m&lt;&#x2F;code&gt;, &lt;code&gt;--max-commands&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;number&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The maximum number of commands to process in this run
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Overrides the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#option-max-commands-per-run&quot;&gt;&lt;code&gt;max_commands_per_run&lt;&#x2F;code&gt; setting from the configuration file&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-d-default-delay&quot;&gt;
		&lt;code&gt;-d&lt;&#x2F;code&gt;, &lt;code&gt;--default-delay&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;milliseconds&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The default wait time between requests.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Overrides the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#option-default-delay-ms&quot;&gt;&lt;code&gt;default_delay_ms&lt;&#x2F;code&gt; setting from the configuration file&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-schedule&quot;&gt;
		&lt;code&gt;--schedule&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;url&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Manually schedule a seed URL.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Overrides the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#option-seeds&quot;&gt;&lt;code&gt;seeds&lt;&#x2F;code&gt; setting from the configuration file&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-config&quot;&gt;
		&lt;code&gt;--config&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;path&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Specify a path to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;&quot;&gt;configuration file&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-policy-file&quot;&gt;
		&lt;code&gt;--policy-file&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;path&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Specify a path to a policy configuration file, it can contain additional &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#policies&quot;&gt;policies with the same notation as in the configurtion file&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-force-recrawl&quot;&gt;
		&lt;code&gt;--force-recrawl&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Ignore when a pages were last crawled and scedule them for recrawling immedeately.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Overrides the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#option-recrawl-interval&quot;&gt;&lt;code&gt;recrawl_interval&lt;&#x2F;code&gt; setting from the configuration file&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt id=&quot;crawl-ignore-db-name-from-config&quot;&gt;
		&lt;code&gt;--ignore-db-name-from-config&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Ignore the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#option-database-name&quot;&gt;&lt;code&gt;database_name&lt;&#x2F;code&gt; setting from the configuration file&lt;&#x2F;a&gt;.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;summarize&quot;&gt;&lt;code&gt;summarize&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;The summarize subcommand takes a crawler database and integrates it into a summary database using the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;&quot;&gt;Summarizing algorithm&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Accepted options are:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
				
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;-c&lt;&#x2F;code&gt;, &lt;code&gt;--crawler-db&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Database file of the crawler database.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;-s&lt;&#x2F;code&gt;, &lt;code&gt;--summary-db&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Database file of the crawler database.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		If the file doesn&#x27;t exist yet, it will be created.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;optimize-db&quot;&gt;&lt;code&gt;optimize-db&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Runs the SQLite internal  analyze and optimize commands on the given crawler or summary  database.&lt;&#x2F;p&gt;
&lt;p&gt;Accepted options are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;-c&lt;&#x2F;code&gt;, &lt;code&gt;--database&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The database file to optimize.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;regenerate-token-index&quot;&gt;&lt;code&gt;regenerate-token-index&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Regenerates the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;token-index&#x2F;&quot;&gt;experimental token based index&lt;&#x2F;a&gt; for use with the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;query-syntax&#x2F;&quot;&gt;&lt;code&gt;token:&lt;&#x2F;code&gt; filter&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Accepted options are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;-s&lt;&#x2F;code&gt;, &lt;code&gt;--summary-database&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The summary database to generate the token index for.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;delete-old-crawl-log-entries&quot;&gt;&lt;code&gt;delete old-crawl-log-entries&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Delete old entries from the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log&lt;&#x2F;a&gt; in the crawler database along with their associated data.&lt;&#x2F;p&gt;
&lt;p&gt;Accepted options are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;-c&lt;&#x2F;code&gt;, &lt;code&gt;--crawler-db&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The crawler database to delete crawl log entries from.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;--keep-latest&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;n&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		How many of the latest entries for each page to keep.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;--apply&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Actually apply the deletion instead of running a simulation.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;debug-indexiness&quot;&gt;&lt;code&gt;debug indexiness&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Prints a breakdown of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;calculating-indexiness&#x2F;&quot;&gt;indexiness calculation&lt;&#x2F;a&gt; for a given page.&lt;&#x2F;p&gt;
&lt;p&gt;Accepted options are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;-c&lt;&#x2F;code&gt;, &lt;code&gt;--database&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The crawler database to fetch the source data fir the indexiness calculation from.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;-u&lt;&#x2F;code&gt;, &lt;code&gt;--url&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;url&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The URL to run the calculation for, if there are multiple, the latest instance will be used.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;debug-query-crawl-log&quot;&gt;&lt;code&gt;debug query-crawl-log&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Query entries from the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log&lt;&#x2F;a&gt; in the crawler database.&lt;&#x2F;p&gt;
&lt;p&gt;Accepted options are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;-c&lt;&#x2F;code&gt;, &lt;code&gt;--crawler-db&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;file&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The crawler database to delete crawl log entries from.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;--uuid&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;uuid&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Query by crawl log entry UUID.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;--host&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;host&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Filter the results by hostname.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;--url&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;url&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Filter the results by URL.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;--exit-code&lt;&#x2F;code&gt; &lt;em&gt;&amp;lt;exit-code&amp;gt;&lt;&#x2F;em&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Filter the results by &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-exit-code&#x2F;&quot;&gt;exit-code&lt;&#x2F;a&gt;, both name and id are accepted.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;debug-sqlite-version&quot;&gt;&lt;code&gt;debug sqlite-version&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Prints the SQLite version that this version of unobtanium is using.&lt;&#x2F;p&gt;
&lt;p&gt;This command takes no options.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>About the unobtanium.rocks Crawler</title>
        <published>2025-07-10T00:00:00+00:00</published>
        <updated>2025-07-10T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/about-the-unobtanium-rocks-crawler/"/>
        <id>https://doc.unobtanium.rocks/manual/about-the-unobtanium-rocks-crawler/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/about-the-unobtanium-rocks-crawler/">&lt;h2 id=&quot;what-is-the-crawler-for&quot;&gt;What is the Crawler for?&lt;&#x2F;h2&gt;
&lt;p&gt;The crawler feeds the development instance of a fully stand alone search engine over at [https:&#x2F;&#x2F;unobtanium.rocks]. The resulting data is also used to test development versions of the unobtanium software. You can find the &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&quot;&gt;source on Codeberg&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-will-the-crawler-behave&quot;&gt;How will the Crawler behave?&lt;&#x2F;h2&gt;
&lt;p&gt;The crawler runs on the same machine as the frontend with &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-rocks-index&#x2F;src&#x2F;branch&#x2F;main&#x2F;index-config&quot;&gt;public configuration&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The user agent is &lt;code&gt;unobtanium.rocks (for https:&#x2F;&#x2F;unobtanium.rocks, {index} index)&lt;&#x2F;code&gt;, where &lt;code&gt;{index}&lt;&#x2F;code&gt; is the name of the configuration file without the &lt;code&gt;.toml&lt;&#x2F;code&gt; suffix that is resposible foe the request. Each user agent is it&#x27;s own independent crawler process, the results are combined at the end.&lt;&#x2F;p&gt;
&lt;p&gt;In general the crawler will start at a configured seed page and follow links from there, for subsequent crawls it will usually remember the last crawl.&lt;&#x2F;p&gt;
&lt;p&gt;Crawls happen in runs that are started manually, usually every few weeks, in that time the crawler will try to dicover new pages and recrawl existing ones.&lt;&#x2F;p&gt;
&lt;p&gt;Unobtanium is a fast, but polite crawler. It will try to crawl as fast as possible while still respecting the servers limits.&lt;&#x2F;p&gt;
&lt;p&gt;The delay between requests will be at least the &lt;code&gt;Crawl-Delay&lt;&#x2F;code&gt; set in the &lt;code&gt;robots.txt&lt;&#x2F;code&gt; (capped at 180 seconds) file or a dynamically calculated minimum that is at least the time it took for the Server to respond, this way, if the Server slows down, the crawler also slows down. The crawler will react to HTTP 429 reponses. Details on the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-delay&#x2F;&quot;&gt;crawl delay algorithm page&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In case the connection itself fails for whatever reason the crawler will immedeately try to send a second request. This behaviour is because failing connections are suprisingly common and usually work on the second try. Unontanium calls these &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;patience-and-fluke-events&#x2F;&quot;&gt;Fluke events&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The crawler will stop once it runs into a preconfigured limit of operations&#x2F;requests per crawl run or when it finds no more crawlable pages.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-is-the-crawler-trying-to-access-private-resources&quot;&gt;Why is the Crawler trying to Access Private Resources?&lt;&#x2F;h2&gt;
&lt;p&gt;Most likely this is because it found a link that points there and no &lt;code&gt;robots.txt&lt;&#x2F;code&gt; rule exists to stop it.&lt;&#x2F;p&gt;
&lt;p&gt;Pages that result in a &lt;code&gt;4xx&lt;&#x2F;code&gt; status code will not be indexed. Though the crawler will keep checking them on every recrawl as if they were dead links.&lt;&#x2F;p&gt;
&lt;p&gt;If the matter seems bigger than some robots.txt entries to you please open an issue on the &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-rocks-index&quot;&gt;index repository&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;robots-txt&quot;&gt;robots.txt&lt;&#x2F;h2&gt;
&lt;p&gt;In general the unobtanium crawler will use the part of its user agent up to the first space for matching the &lt;code&gt;robots.txt&lt;&#x2F;code&gt;, for the unobtanium.rocks crawler this will be &lt;code&gt;unobtanium.rocks&lt;&#x2F;code&gt; independent of whoch index the site is in.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;robots.txt&lt;&#x2F;code&gt; file will be refected in a regular interval (~30 minutes) while crawling which makes it possible for web admins to directly stop or slow down the crawler while in case it is going where it shouldn&#x27;t or is too fast.&lt;&#x2F;p&gt;
&lt;p&gt;The first request of every crawl will be to fetch the robots.txt file.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;who-is-responsible-for-the-crawler&quot;&gt;Who is Responsible for the Crawler?&lt;&#x2F;h2&gt;
&lt;p&gt;Like the rest of unobtanium.rocks the unobtanium.rocks crawler is operated by &lt;a href=&quot;https:&#x2F;&#x2F;slatecave.net&#x2F;about&#x2F;me#contact&quot;&gt;Slatian&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;see-also&quot;&gt;See Also&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;Details about the crawling algorithm&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-user-agent-and-robots-txt&#x2F;&quot;&gt;Configuring the &lt;code&gt;User-Agent&lt;&#x2F;code&gt; for crawler Admins&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;&quot;&gt;The crawler manual for Admins&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium-rocks-index&quot;&gt;The index configuration for Unontanium.rocks&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Configuring SearxNG to use Unobtanium</title>
        <published>2025-02-23T00:00:00+00:00</published>
        <updated>2025-02-23T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/searxng/"/>
        <id>https://doc.unobtanium.rocks/manual/searxng/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/searxng/">&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;searxng.org&quot;&gt;SearxNG&lt;&#x2F;a&gt; is a metasearch engine that can combine the results of multiple search engines.&lt;&#x2F;p&gt;
&lt;p&gt;To configure an Unobtanium instance for searx one can use the &lt;a href=&quot;https:&#x2F;&#x2F;docs.searxng.org&#x2F;dev&#x2F;engines&#x2F;json_engine.html&quot;&gt;&lt;code&gt;json_engine&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; module and the fact that all HTML views of unobtanium can be turned into an API by passing in &lt;code&gt;format=json&lt;&#x2F;code&gt; as an additional URL query parameter.&lt;&#x2F;p&gt;
&lt;p&gt;Add the following to the &lt;a href=&quot;https:&#x2F;&#x2F;docs.searxng.org&#x2F;admin&#x2F;settings&#x2F;settings_engines.html&quot;&gt;&lt;code&gt;engines:&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; list in the settings.yaml file of SearxNG and replace &lt;code&gt;unobtanium.rocks&lt;&#x2F;code&gt; with the domain name of your preferred instance for the &lt;code&gt;search_url&lt;&#x2F;code&gt; and &lt;code&gt;website&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;yaml&quot; class=&quot;language-yaml &quot;&gt;&lt;code class=&quot;language-yaml&quot; data-lang=&quot;yaml&quot;&gt;  - name: unobtanium
    engine: json_engine
    shortcut: uo
    categories: general
    paging: true
    search_url: https:&amp;#x2F;&amp;#x2F;unobtanium.rocks&amp;#x2F;search?format=json&amp;amp;search={query}&amp;amp;page={pageno}
    first_page_num: 0
    results_query: list
    title_query: title
    url_query: url
    content_query: description
    categories: [general, web]
    timeout: 3
    about:
      website: https:&amp;#x2F;&amp;#x2F;unobtanium.rocks
      use_official_api: true
      require_api_key: false
      results: JSON
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Depending on how big your database and how fast your server is you may want to adjust the &lt;code&gt;timeout&lt;&#x2F;code&gt; option. While in general unobtanium is pretty fast for most queries, it is possible to submit queries that take a long time to complete.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Base Database</title>
        <published>2025-02-09T00:00:00+00:00</published>
        <updated>2025-04-10T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/base-database/"/>
        <id>https://doc.unobtanium.rocks/data/base-database/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/base-database/">&lt;p&gt;The base database schema is shared between the crawler database and the summary database and provides basic facilities for storing data that is always needed in the context of a search engine.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; While the base database schema is often referred to as &quot;The base database&quot; it never is a standalone database. Its always embedded into another database.&lt;&#x2F;p&gt;
&lt;p&gt;Tables in the base database are:




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;unobtanium_database_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Used to store global database metadata.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;origin&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		URL Origins, queryable by component
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;url&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		URLs, queryable by component
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;mimetype&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Mimetypes&#x2F;Mediatypes
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;mime_parameter&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Extends the &lt;code&gt;mimetype&lt;&#x2F;code&gt; table with SQL queryable parameter information
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;&#x2F;p&gt;
&lt;h2 id=&quot;tables&quot;&gt;Tables&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;unobtanium-database-info&quot;&gt;&lt;code&gt;unobtanium_database_info&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Added in version 0.1.0 the &lt;code&gt;unobtanium_database_info&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;key&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;value&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  &quot;Schemaless&quot; configuration data
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;The main purpose of this table is storing global configuration information about the database and its schema.&lt;&#x2F;p&gt;
&lt;p&gt;See also: &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;database-info&#x2F;&quot;&gt;Database Info&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;origin&quot;&gt;&lt;code&gt;origin&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;origin&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;origin_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;port&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The port as specified in the URL or a well known default
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;scheme&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Varchar&lt;&#x2F;a&gt;  The scheme as specified in the URL
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;host&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Varchar&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The hostname as specified in the URL, if present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;str_origin&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;unique-badge&#x27;&gt;Unique&lt;&#x2F;span&gt; The origin as a string, as if it was an URL consisting only of &lt;code&gt;scheme&lt;&#x2F;code&gt;, &lt;code&gt;host&lt;&#x2F;code&gt; and &lt;code&gt;port&lt;&#x2F;code&gt;, main purpose is deduplication.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;s&gt;&lt;code&gt;crawl_delay_ms&lt;&#x2F;code&gt;&lt;&#x2F;s&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Delay in milliseconds that the crawler should wait between requests, taken from robots.txt
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		&lt;strong&gt;Deprecated&lt;&#x2F;strong&gt; since 0.0.0: This is crawler specific information that shouldn&#x27;t be part of the base database.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		&lt;strong&gt;Removed&lt;&#x2F;strong&gt; in 0.1.0
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;&lt;s&gt;There is a unique constraint across the fields &lt;code&gt;port&lt;&#x2F;code&gt;, &lt;code&gt;scheme&lt;&#x2F;code&gt; and &lt;code&gt;host&lt;&#x2F;code&gt;.&lt;&#x2F;s&gt; (Removed in Version 0.1.0)&lt;&#x2F;p&gt;
&lt;p&gt;See the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;url&#x2F;#origin&quot;&gt;Origin data type&lt;&#x2F;a&gt; for more information.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;url&quot;&gt;&lt;code&gt;url&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;url&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;url_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;origin_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference into the &lt;code&gt;origin&lt;&#x2F;code&gt; table
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;path&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Path part of the URL, if present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;query&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Query part of the URL, if present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;username&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Username part of the URL, if present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;password&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; Password part of the URL, if present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;str_url&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;unique-badge&#x27;&gt;Unique&lt;&#x2F;span&gt; String representation of the whole URL for easy retrieval
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;The &lt;code&gt;url&lt;&#x2F;code&gt; table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document.&lt;&#x2F;p&gt;
&lt;p&gt;Fragment fields are stored together with URL ids where appropriate.&lt;&#x2F;p&gt;
&lt;p&gt;See the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;url&#x2F;&quot;&gt;URL data type&lt;&#x2F;a&gt; for more information.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;mimetype&quot;&gt;&lt;code&gt;mimetype&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;mimetype&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;mimetype_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;mime_type&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The part before the &lt;code&gt;&#x2F;&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;mime_subtype&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  The part after the &lt;code&gt;&#x2F;&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;mime_suffix&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The part after the &lt;code&gt;+&lt;&#x2F;code&gt; if present
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;charset&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;null-badge&#x27;&gt;Null&lt;&#x2F;span&gt; The &lt;code&gt;charset&lt;&#x2F;code&gt; parameter. This is in here since it is by far the most common and webservers love sending it.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;str_mimetype&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt; &lt;span class=&#x27;unique-badge&#x27;&gt;Unique&lt;&#x2F;span&gt; String representation of the whole mimetype with the parameters sorted in alphabetical order.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h3 id=&quot;mime-parameter&quot;&gt;&lt;code&gt;mime_parameter&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;mime_parameter&lt;&#x2F;code&gt; table has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;mime_parameter_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt; &lt;span class=&#x27;primary-key-badge&#x27;&gt;Primary Key&lt;&#x2F;span&gt; Internal database id for the mime parameter, not used.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;mimetype_id&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#integer&#x27; &gt;Integer&lt;&#x2F;a&gt;  Reference into the &lt;code&gt;mimetype&lt;&#x2F;code&gt; table.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;key&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Key of the mimetype parameter
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;value&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&#x27;&#x2F;data&#x2F;database-data-types#text&#x27; &gt;Text&lt;&#x2F;a&gt;  Value of the mimetype parameter
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;version-history&quot;&gt;Version history&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;0-0-0-the-last-unversioned&quot;&gt;0.0.0 - The last unversioned&lt;&#x2F;h3&gt;
&lt;p&gt;Version 0.0.0 represents the last unversioned version of the database schema.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;0-1-0-basic-cleanup-2025-04-10&quot;&gt;0.1.0 - Basic cleanup (2025-04-10)&lt;&#x2F;h3&gt;
&lt;p&gt;Version 0.1.0 cleans up some historic design choices.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;For the &lt;code&gt;origin&lt;&#x2F;code&gt; table add a unique &lt;code&gt;str_origin&lt;&#x2F;code&gt; and remove the unique constraint from &lt;code&gt;scheme&lt;&#x2F;code&gt;, &lt;code&gt;host&lt;&#x2F;code&gt; and &lt;code&gt;port&lt;&#x2F;code&gt;, which had some &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;issues&#x2F;15&quot;&gt;edge case problems because of &lt;code&gt;host&lt;&#x2F;code&gt; and &lt;code&gt;port&lt;&#x2F;code&gt; being nullable&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Remove the &lt;code&gt;crawl_delay_ms&lt;&#x2F;code&gt; column from the &lt;code&gt;origin&lt;&#x2F;code&gt; table. It stems from a very early version of the crawler and unused.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawl Summary</title>
        <published>2025-01-06T00:00:00+00:00</published>
        <updated>2025-01-06T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/crawl-summary/"/>
        <id>https://doc.unobtanium.rocks/data/crawl-summary/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/crawl-summary/">&lt;p&gt;This document describes the crawl summary data structure stored in the summary database and is part of the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;entity-data-tree&#x2F;&quot;&gt;entity data tree&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;It is &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;#integrate-crawl-information&quot;&gt;almost directly derived&lt;&#x2F;a&gt; from the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A crawl summary has the following fields:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;crawl_type&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Which kind of crawl was performed. (&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;#crawl-types&quot;&gt;Same as in the crawl log&lt;&#x2F;a&gt;)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		UUID to indetify a crawl across databases, taken from crawl log.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_time&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Taken from &lt;code&gt;time_started&lt;&#x2F;code&gt; in the crawl log.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;agent_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The uuid indentifying the agent that did the crawling.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;exit_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-exit-code&#x2F;&quot;&gt;crawl exit code&lt;&#x2F;a&gt; that summarizes the outcome of the
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;server_last_modified&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		When the resource was last modified according to the server (UTC timestamp)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;request_duration_ms&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		How long the request took in milliseconds
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;was_robotstxt_approved&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Wheter the request was approved by robots.txt.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;http&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		One optional http summary
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;The URL is not prt of this data structure and is assumed to match the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;entity generation&lt;&#x2F;a&gt; URL.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;http-summary&quot;&gt;HTTP Summary&lt;&#x2F;h2&gt;
&lt;p&gt;The http summary fields are:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;status_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The &lt;a href=&quot;https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;HTTP&#x2F;Status&quot;&gt;status code&lt;&#x2F;a&gt; returned by the server
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;etag&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Optional, &lt;a href=&quot;https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;HTTP&#x2F;Headers&#x2F;ETag&quot;&gt;ETag&lt;&#x2F;a&gt; retuened by the server
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Useful database queries</title>
        <published>2025-01-06T00:00:00+00:00</published>
        <updated>2025-01-06T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/useful-database-queries/"/>
        <id>https://doc.unobtanium.rocks/manual/useful-database-queries/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/useful-database-queries/">&lt;h2 id=&quot;count-active-entity-generations-in-summary-database&quot;&gt;Count Active entity generations in summary database&lt;&#x2F;h2&gt;
&lt;p&gt;This query counts all &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;entity generations that&lt;&#x2F;a&gt; are not marked as duplicate, have not been marked as EOL yet, have a non-empty title and at least a bit of text or secondary text in their &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;text pile generation 1&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sql&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;select count(*)
from entity_generation
inner join document_description
	on entity_generation.entity_generation_id = document_description.entity_generation_id
inner join text_pile
	on entity_generation.text_pile_id = text_pile.text_pile_id
where (
	(NOT entity_generation.marked_duplicate) AND
	entity_generation.confirmed_end_unix_utc is NULL AND
	(document_description.title != &amp;#x27;&amp;#x27;) AND
	(text_pile.text != &amp;#x27;&amp;#x27; OR text_pile.secondary_text != &amp;#x27;&amp;#x27;)
);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;count-and-delete-unused-tokens-in-token-index&quot;&gt;Count and delete unused tokens in token index&lt;&#x2F;h2&gt;
&lt;p&gt;This is for the experimental &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;token-index&#x2F;&quot;&gt;token index&lt;&#x2F;a&gt; that is generated using the &lt;code&gt;regenerate-token-index&lt;&#x2F;code&gt; on the crawler.&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sql&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;select count(*)
from token
where not exists (
	select 1
	from token_statistics
	where token_statistics.token_id = token.token_id
	limit 1
);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And to delete them:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sql&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;delete
from token
where not exists (
	select 1
	from token_statistics
	where token_statistics.token_id = token.token_id
	limit 1
);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawl exit code</title>
        <published>2024-12-22T00:00:00+00:00</published>
        <updated>2025-08-10T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/crawl-exit-code/"/>
        <id>https://doc.unobtanium.rocks/data/crawl-exit-code/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/crawl-exit-code/">&lt;p&gt;Exit codes enumerate the outcome of a crawling reequest on a high level, they have fixed integers. Referring to them by name should be preferred, the integer representation is mainly for copact storage in databases.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; Some of those exit codes are from a time when unobtanium had a very different and more database coupled archtecture, they are unused now, but remain here to keep the number reserved.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Id&lt;&#x2F;th&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;-3&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;database_error&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The crawl couldn&#x27;t be finished because of a database error&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;-2&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;cancelled&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The crawl was cancelled mid-way (it would also be okay to discard those)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;-1&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;someone_stole_my_work&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;If a race condition was detected in a task queue&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;20&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file_ingested&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;A file was crawled and the content ingested&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;29&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file_of_unknown_type&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;A file was found at the requested location, but the crawler doesn&#x27;t know what to do&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;31&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;permanent_redirect&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The server indicated a redirect and hited that it isn&#x27;t going away soon.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;32&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;redirect&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The server responded with a redirect, this may change in hte future&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;34&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file_did_not_change&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The server communicated that the file did not change since the last request&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;40&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;server_blamed_client&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The server respose indicated that the client sent a request wrong, this includes authentication errors&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;41&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file_gone&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The Server indicated that the requested resource isn&#x27;t available and won&#x27;t come back&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;42&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;did_not_understand_answer&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The crawler couldn&#x27;t receive the servers answer because of a protocol error or encoding issue&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;44&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file_not_found&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The server communicated that there is no resource at the requested location, this may change in the future&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;49&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;rate_limited&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The Server communicated that the crawler was going too fast&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;50&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;server_internal_error&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The Server couldn&#x27;t answer because of an internal error&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;100&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;connection_failed&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Could not connect to the server&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;101&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;request_timeout&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The server tok too long to repond to the request.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;102&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;error_reading_response&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;A Problem occourred while reading the server response (i.e. unexpected connection termination)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;170&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;blocked_by_robots_txt&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;At the request of the Servers &lt;code&gt;robots.txt&lt;&#x2F;code&gt; the resource wasn&#x27;t crawled&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;171&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;blocked_at_request_of_remote&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The server requested to not crawl the given resource&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;172&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;blocked_origin_by_local_policy&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Crawling was blocked by a local policy on the origin level&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;173&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;blocked_url_by_local_policy&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Crawling was blocked by a local policy on the url level&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;174&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;blocked_by_challenge&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The Server returned a challenge&#x2F;captcha page of some sort&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;180&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;not_canonical&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The content was discarded because it marked itself as not being canonical&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;181&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;duplicate&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;The resource was found to be a duplicate by the crawler or a post-processing stage&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;-999&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;unknown_error&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Placehholder for Errors that don&#x27;t have an exit code assigned yet.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;redirects&quot;&gt;Redirects&lt;&#x2F;h2&gt;
&lt;p&gt;The codes &lt;code&gt;redirect&lt;&#x2F;code&gt; and &lt;code&gt;permanent_redirect&lt;&#x2F;code&gt; represent redirects.&lt;&#x2F;p&gt;
&lt;p&gt;They may have meaningful metadata about the destination of the redirect attached.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;blocked&quot;&gt;Blocked&lt;&#x2F;h2&gt;
&lt;p&gt;The codes &lt;code&gt;blocked_by_robots_txt&lt;&#x2F;code&gt;, &lt;code&gt;blocked_at_request_of_remote&lt;&#x2F;code&gt;, &lt;code&gt;blocked_origin_by_local_policy&lt;&#x2F;code&gt;, &lt;code&gt;blocked_url_by_local_policy&lt;&#x2F;code&gt; represent crawls where either no request was sent or the returned data was discarded because the resource indicated that it didn&#x27;t want to be indexed.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;blocked_by_challenge&lt;&#x2F;code&gt; indicates that the crawler ran into a challenge&#x2F;captcha page that is likely intended to keep bad bots out, this is unfortunate and not a clear signal. (The crawler should be stopped &lt;code&gt;robots.txt&lt;&#x2F;code&gt; before letting it run into a challenge page)&lt;&#x2F;p&gt;
&lt;h2 id=&quot;could-be-a-fluke&quot;&gt;Could be a Fluke&lt;&#x2F;h2&gt;
&lt;p&gt;The codes &lt;code&gt;unknown_error&lt;&#x2F;code&gt;, &lt;code&gt;connection_failed&lt;&#x2F;code&gt;, &lt;code&gt;request_timeout&lt;&#x2F;code&gt; and &lt;code&gt;error_reading_response&lt;&#x2F;code&gt; all can have temporary networking problems as a possible cause and are therefore worth retrying immedeately &lt;strong&gt;once&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;patience-and-fluke-events&#x2F;&quot;&gt;See the Fluke Event concept for more information.&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;contentful&quot;&gt;Contentful&lt;&#x2F;h2&gt;
&lt;p&gt;The following codes are considered being &lt;i&gt;contentful&lt;&#x2F;i&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;file_ingested&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;file_of_unknown_type&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;permanent_redirect&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;redirect&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;file_not_found&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;file_gone&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;not_canonical&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;duplicate&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;These have in common that from this request the crawler explicitly learned something about a resource or its absence.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;file_did_not_change&lt;&#x2F;code&gt; is excluded as the crawler explicitly didn&#x27;t learn anything new from that.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawl Log</title>
        <published>2024-12-22T00:00:00+00:00</published>
        <updated>2024-12-22T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/crawl-log/"/>
        <id>https://doc.unobtanium.rocks/data/crawl-log/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/crawl-log/">&lt;p&gt;The crawl log is an append only &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawler-database&#x2F;#crawl-log&quot;&gt;table in the crawler database&lt;&#x2F;a&gt;, that stores when which URL was crawled, how long that crawl took and what the outcome was.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;crawl-types&quot;&gt;Crawl Types&lt;&#x2F;h2&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Id&lt;&#x2F;th&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;file_crawl&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;File crawl for general purpose indexing&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;robotstxt_fetch&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Fetching of &lt;code&gt;robots.txt&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;metadata_crawl&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;HEAD&lt;&#x2F;code&gt; only request no content was indexded&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Unobtanium Overview</title>
        <published>2024-12-07T00:00:00+00:00</published>
        <updated>2024-12-07T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/overview/"/>
        <id>https://doc.unobtanium.rocks/manual/overview/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/overview/">&lt;h2 id=&quot;components&quot;&gt;Components&lt;&#x2F;h2&gt;
&lt;p&gt;Unobtanium is made up of:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;unobtanium&lt;&#x2F;code&gt; (lib-unobtanium)
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Main application library that implements most data-structures and database logic.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;criterium&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Query framework for matching Data in memory and in the DB against criteria.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;criterium&quot;&gt;API Documentation&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;unobtanium-crawler&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Web Crawling and summarizing application of Unobtanium.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;&quot;&gt;Manual&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;unobtanium-viewer&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Web frontend for querying an Unobtanium summary database.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-viewer&#x2F;&quot;&gt;Manual&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		crawler database
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Database schema optimized for crawling.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawler-database&#x2F;&quot;&gt;Schema Documentation&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		summary database
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Database schema optimized for querying&#x2F;searching.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;summary-database&#x2F;&quot;&gt;Schema Documentation&lt;&#x2F;a&gt;
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;data-pipeline&quot;&gt;Data Pipeline&lt;&#x2F;h2&gt;
&lt;p&gt;The main Unobtanium data pipeline consists of three steps:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;Crawling&lt;&#x2F;a&gt; (Web to crawler database)&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;&quot;&gt;Summarizing&lt;&#x2F;a&gt; (crawler database to summary database)&lt;&#x2F;li&gt;
&lt;li&gt;Querying&#x2F;Searching (summary database to curious Creature) (&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;query-syntax&#x2F;&quot;&gt;Query Syntax&lt;&#x2F;a&gt;)&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Each step in independent of the previous, so no huge setup is needed to get Unobtanium working.&lt;&#x2F;p&gt;
&lt;p&gt;Crawling and summarizing are decoupled to make iterating on code and configuration easier, as summarizing is a quite complex step.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Duplicate Types</title>
        <published>2024-11-18T00:00:00+00:00</published>
        <updated>2024-11-18T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/concept/duplicate-types/"/>
        <id>https://doc.unobtanium.rocks/concept/duplicate-types/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/concept/duplicate-types/">&lt;p&gt;In general duplicates are different files that have the same main content, that is relevant for indexing.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;exact-duplicates&quot;&gt;Exact Duplicates&lt;&#x2F;h2&gt;
&lt;p&gt;Exact duplicates are detected using the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;@data&#x2F;text_pile.md#digest&quot;&gt;Text Pile Digest&lt;&#x2F;a&gt;, they are byte for byte duplicates of the original. Link destinations and Formatting (unless relevant for the Text Pile) are ignored.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;self-duplicates&quot;&gt;Self-Duplicates&lt;&#x2F;h2&gt;
&lt;p&gt;Self-Duplicates are a special case of exact duplicates where a duplication is detected to content that came from the same URL earlier. If that is exactly the previous version unobtanium knows that the main content of the page hasn&#x27;t changed, even if the Server didn&#x27;t explicitly say so.&lt;&#x2F;p&gt;
&lt;p&gt;!note: &lt;b&gt;Important quirk:&lt;&#x2F;b&gt;
Self-Duplicates aren&#x27;t treated as duplicates, but rather as &quot;content didn&#x27;t change&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;They are not recorded as duplicates.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Summarizing</title>
        <published>2024-11-14T00:00:00+00:00</published>
        <updated>2025-11-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/algorithm/summarizing/"/>
        <id>https://doc.unobtanium.rocks/algorithm/summarizing/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/algorithm/summarizing/">&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; The crawl loop is built into the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;#summarize&quot;&gt;&lt;code&gt;unobtanium-crawler summarize&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; command.&lt;&#x2F;p&gt;
&lt;p&gt;The summarizing algorithm takes a crawl database and turns it into the summary database, this algorithm processes all files that are available for search later.&lt;&#x2F;p&gt;
&lt;p&gt;This is done by iterating all &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log entries&lt;&#x2F;a&gt; and then deciding what to do with them.&lt;&#x2F;p&gt;
&lt;p&gt;For easier understanding the algorithm is described as if every file was processed individually, in reality it is &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;database-batching-pattern&#x2F;&quot;&gt;batched&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The algorithm is implemented in &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;src&#x2F;branch&#x2F;main&#x2F;crawler&#x2F;src&#x2F;summarizer&#x2F;&quot;&gt;crawler&#x2F;src&#x2F;summarizer&lt;&#x2F;a&gt; as part of the crawler.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;Fetch file information from the crawl database&lt;&#x2F;li&gt;
&lt;li&gt;Cross check with the summary database to only integrate non-integrated entities&lt;&#x2F;li&gt;
&lt;li&gt;Turn the raw informtion from the crawler into scrape results&lt;&#x2F;li&gt;
&lt;li&gt;Detect &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#self-duplicates&quot;&gt;self-duplicates&lt;&#x2F;a&gt; and generate candidates for &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#exact-duplicates&quot;&gt;exact duplicate&lt;&#x2F;a&gt; detection&lt;&#x2F;li&gt;
&lt;li&gt;Turn self duplicates into &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-summary&#x2F;&quot;&gt;crawl summaries&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Generate metatadata for all non self-duplicates:
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-summary&#x2F;&quot;&gt;Crawl summaries&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;Entity generations&lt;&#x2F;a&gt; to create&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;Entity generations&lt;&#x2F;a&gt; to close&lt;&#x2F;li&gt;
&lt;li&gt;Link summaries&lt;&#x2F;li&gt;
&lt;li&gt;File summaries&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Store all data derived from both self-duplicates and non self-duplicates into the summary database.&lt;&#x2F;li&gt;
&lt;li&gt;Flag &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#exact-duplicates&quot;&gt;exact duplicates&lt;&#x2F;a&gt; using the duplicate candidates from earlier&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;steps-in-detail&quot;&gt;Steps in Detail&lt;&#x2F;h2&gt;
&lt;p&gt;Fetch &lt;code&gt;file_info&lt;&#x2F;code&gt; from the crawl database.&lt;&#x2F;p&gt;
&lt;p&gt;Fetch the corresponding &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;&lt;code&gt;crawl_log_entry&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Test if the summary database already has &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-summary&#x2F;&quot;&gt;crawl summary&lt;&#x2F;a&gt; matching the the &lt;code&gt;crawl_log_entry.uuid&lt;&#x2F;code&gt;, if yes that file has alredy been summarized in a previous run and the algorithm &lt;strong&gt;skips the file&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;self-duplicate-detection&quot;&gt;Self-Duplicate Detection&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; At this point a lot of independet things will be started concurrently. This is mostly done to keep the batch iteration count low.&lt;&#x2F;p&gt;
&lt;p&gt;Derive the &lt;code&gt;file_summary&lt;&#x2F;code&gt; (&lt;code&gt;text_pile&lt;&#x2F;code&gt; (deprecated) + &lt;code&gt;document_desciption&lt;&#x2F;code&gt;), &lt;code&gt;text_pile_ng&lt;&#x2F;code&gt; and &lt;code&gt;link_summaries&lt;&#x2F;code&gt;. See &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;#deriving-content&quot;&gt;Deriving Content&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;If the summary database already contains a self duplicate Entity generation:&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Use its UUID as the &lt;code&gt;entity_generation_uuid&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Add it to the mapping from &lt;code&gt;file_id&lt;&#x2F;code&gt; to &lt;code&gt;entity_generation_uuids&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;b&gt;Otherwise, not a self-duplicate:&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Generate a new &lt;code&gt;entity_generation_uuid&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Remember URL and text pile digest for duplicate detection later&lt;&#x2F;li&gt;
&lt;li&gt;Remember &lt;code&gt;file_summary&lt;&#x2F;code&gt; and &lt;code&gt;link_summaries&lt;&#x2F;code&gt; for adding to the database later.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;b&gt;Endif&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;generate-entity-generations&quot;&gt;Generate Entity Generations&lt;&#x2F;h3&gt;
&lt;p&gt;Fetch &lt;code&gt;request_info&lt;&#x2F;code&gt; for the file.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;If not a self duplicate:&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Generate a new &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; from the data collected so far:&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;url&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from &lt;code&gt;file_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		The &lt;code&gt;entity_generation_uuid&lt;&#x2F;code&gt; generated earlier.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;first_seen&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Taken from when the request was started from &lt;code&gt;request_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_seen&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Same as &lt;code&gt;first_seen&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_end_confirmed&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Set to &lt;code&gt;None&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;marked_duplicate&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Set to &lt;code&gt;false&lt;&#x2F;code&gt; (innocent until proven guilty)
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;Store &lt;code&gt;entity_generation&lt;&#x2F;code&gt; into the database.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; At this point things in the database can be connected with the &lt;code&gt;entity_generation_uuid&lt;&#x2F;code&gt; in the database.&lt;&#x2F;p&gt;
&lt;p&gt;Close any old Entity generation based on URL and the &lt;code&gt;first_seen&lt;&#x2F;code&gt; time.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Endif&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Store &lt;code&gt;file_summary&lt;&#x2F;code&gt; and &lt;code&gt;link_summaries&lt;&#x2F;code&gt; into the database.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;integrate-crawl-information&quot;&gt;Integrate Crawl Information&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;b&gt;If there is a request, associated with the file:&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-summary&#x2F;&quot;&gt;&lt;code&gt;crawl_summary&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; is generated from the information in &lt;code&gt;request_info&lt;&#x2F;code&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;&lt;code&gt;crawl_log_entry&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;crawl_time&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from &lt;code&gt;request_info&lt;&#x2F;code&gt; time sent
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from &lt;code&gt;request_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;agent_uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		resolved from the &lt;code&gt;crawl_log_entry&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;crawl_type&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from the &lt;code&gt;crawl_log_entry&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;was_robotstxt_approved&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from &lt;code&gt;request_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;server_last_modified&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from &lt;code&gt;request_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;exit_code&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from &lt;code&gt;request_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;request_duration_ms&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from &lt;code&gt;request_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;http&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		taken from &lt;code&gt;request_info&lt;&#x2F;code&gt;
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;Store the &lt;code&gt;crawl_summary&lt;&#x2F;code&gt; into the database.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Endif&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;finding-duplicates&quot;&gt;Finding duplicates&lt;&#x2F;h3&gt;
&lt;p&gt;To find duplicates query the database for entity generations that match the following criteria:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Must not have a confirmed end time.&lt;&#x2F;li&gt;
&lt;li&gt;Must be from the same origin as the original &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;&lt;code&gt;entity_generation&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; (as a proxy for being on the same website)&lt;&#x2F;li&gt;
&lt;li&gt;Must have the same  &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;&lt;code&gt;text_pile_ng&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; digest as the &lt;code&gt;text_pile_ng&lt;&#x2F;code&gt; of the original.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;From the results plus the original &lt;code&gt;entity_generation&lt;&#x2F;code&gt; the one with the shortest URL (other criteria are possible if better ones are available) is picked as the one to be marked as the original, all other entity generations in the list get marked as duplicates.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;deriving-content&quot;&gt;Deriving Content&lt;&#x2F;h2&gt;
&lt;p&gt;The &lt;code&gt;text_pile&lt;&#x2F;code&gt;, &lt;code&gt;text_pile_ng&lt;&#x2F;code&gt;, &lt;code&gt;link_summaries&lt;&#x2F;code&gt; and &lt;code&gt;document_description&lt;&#x2F;code&gt; are assembled from by fetching file content from the database and running an appropriate scraping algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Todo:&lt;&#x2F;b&gt; Link scraping algorithms here.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;document_desciption&lt;&#x2F;code&gt; indexiness is calculated from the &lt;code&gt;link_summaries&lt;&#x2F;code&gt; and &lt;code&gt;document_desciption&lt;&#x2F;code&gt; using the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;calculating-indexiness&#x2F;&quot;&gt;indexiness algorithm&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;file_summary&lt;&#x2F;code&gt; is generated from the &lt;code&gt;file_info&lt;&#x2F;code&gt;, &lt;code&gt;document_description&lt;&#x2F;code&gt;, and &lt;code&gt;text_pile&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;changes-to-the-algorithm&quot;&gt;Changes to the Algorithm&lt;&#x2F;h2&gt;
&lt;p&gt;After release &lt;code&gt;3.0.0&lt;&#x2F;code&gt; the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;Text Pile (Gen1)&lt;&#x2F;a&gt; has been replaced by the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;Text Pile Gen2&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Text Pile Gen1</title>
        <published>2024-10-31T00:00:00+00:00</published>
        <updated>2024-10-31T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/text-pile-gen1/"/>
        <id>https://doc.unobtanium.rocks/data/text-pile-gen1/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/text-pile-gen1/">&lt;p&gt;The Text Pile &lt;abbr title=&quot;Generation 1&quot;&gt;Gen1&lt;&#x2F;abbr&gt; (&quot;Legacy&quot;) is a datastructure for storing unformatted content text that is extracted from pages during the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;&quot;&gt;summarizing step&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;!warning: &lt;b&gt;Deprecated:&lt;&#x2F;b&gt; The Text Pile is deprecated after version &lt;code&gt;3.0.0&lt;&#x2F;code&gt; see the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;Text Pile Gen2&lt;&#x2F;a&gt; for the datastructure that replaced it.&lt;&#x2F;p&gt;
&lt;p&gt;The purpose of the text pile is to have a data source for the full text search that can also be used to generate preview snippets from any relevant point in the text.&lt;&#x2F;p&gt;
&lt;p&gt;The text pile has the following fields:&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;max-one-dd&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		What has been determined the main file content, roughly one line per paragrph, also includes headlines, code, quotes etc.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;secondary_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Like &lt;code&gt;text&lt;&#x2F;code&gt;, but for everything that is not marked as the main file content.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;big_headlines&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Bigger headlines (down to level 2 or 3, depending on derivation algorithm), headlines are newline seperated.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;small_headlines&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Headlines below the level of what goes into &lt;code&gt;big_headlines&lt;&#x2F;code&gt;, same format.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;code_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Text that was marked up as code.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;quote_text&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Text that is marked up as being quoted.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;p&gt;Note that &lt;code&gt;text&lt;&#x2F;code&gt; and &lt;code&gt;secondary_text&lt;&#x2F;code&gt; together contain all of the pages content, the other fields are effecitively duplicates of specific subsections for the purpose weighting them differently.&lt;&#x2F;p&gt;
&lt;p&gt;If there is no text for a given field it must contain an empty string.&lt;&#x2F;p&gt;
&lt;p&gt;Leading and trailing space characters must be stripped.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;digest&quot;&gt;Digest&lt;&#x2F;h2&gt;
&lt;p&gt;To quicly detect and eliminate &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#exact-duplicates&quot;&gt;exact duplicates&lt;&#x2F;a&gt; ech Text Pile has a digest that is calculated using the &lt;a href=&quot;https:&#x2F;&#x2F;www.blake2.net&#x2F;&quot;&gt;Blake2b&lt;&#x2F;a&gt; 512bit algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;The hash is fed each field of the text pile, with a delimiter made up of three linebreaks &lt;i&gt;between&lt;&#x2F;b&gt; each two fields &lt;code&gt;\n\n\n&lt;&#x2F;code&gt;. If a field is empty it is fed as an empty string.&lt;&#x2F;p&gt;
&lt;p&gt;They are fed to the has function in the following order:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;text&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;secondary_text&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;big_headlines&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;small_headlines&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;code_text&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;quote_text&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Calculating Indexiness</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2024-10-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/algorithm/calculating-indexiness/"/>
        <id>https://doc.unobtanium.rocks/algorithm/calculating-indexiness/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/algorithm/calculating-indexiness/">&lt;p&gt;Indexiness us a numeric score that helps to distinguish between Index and Leaf-pages. Where index pages are navigation heavy and the leaf pages have some content. The score is mostly derived from a pages link elements.&lt;&#x2F;p&gt;
&lt;p&gt;A negative indexiness score means that a document is a leaf page (not very indexi), a positive score means that the document is a navigation page. A higher absolute score correlates with how confident the algorithm is that it is correct. The absolute score height also correlates with the number of links on a page.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;relevant-data-structures&quot;&gt;Relevant Data-Structures&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;link-locality&#x2F;&quot;&gt;Link Locality&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;location-signature&#x2F;&quot;&gt;Location signature&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;what-makes-a-page-an-index-or-a-leaf&quot;&gt;What makes a Page an Index or a Leaf?&lt;&#x2F;h2&gt;
&lt;p&gt;The following are some stereotype criteria for each page-type:&lt;&#x2F;p&gt;
&lt;p&gt;Index pages have:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Less outlinks&lt;&#x2F;li&gt;
&lt;li&gt;No publishing date&lt;&#x2F;li&gt;
&lt;li&gt;Inlinks in lists&lt;&#x2F;li&gt;
&lt;li&gt;Inlinks in headers&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Leaf pages have:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;More outlinks&lt;&#x2F;li&gt;
&lt;li&gt;Selflinks in headers (to get link-able headers)&lt;&#x2F;li&gt;
&lt;li&gt;Selflinks in lists (table of contents)&lt;&#x2F;li&gt;
&lt;li&gt;A publishing date&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;There are other criteria too, like the paragraph size&#x2F;count that &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;readability&#x2F;blob&#x2F;main&#x2F;Readability.js&quot;&gt;readability.js&lt;&#x2F;a&gt; uses, but the above seem enough to classify most pages correctly.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;metrics&quot;&gt;Metrics&lt;&#x2F;h2&gt;
&lt;p&gt;To calculate the indexiness the following metrics have to be known about a document.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;&#x2F;th&gt;&lt;th&gt;How to get&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;has_publishing_date&lt;&#x2F;td&gt;&lt;td&gt;Did the crawler find a publishing date in the page metadata?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;selflinks_in_headers&lt;&#x2F;td&gt;&lt;td&gt;How many selflinks either are in a headline or contain a headline?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;selflinks_in_lists&lt;&#x2F;td&gt;&lt;td&gt;How many selflinks are in some kind of list?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;inlinks_in_headers&lt;&#x2F;td&gt;&lt;td&gt;How many inlinks either are in a headline or contain a headline?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;inlinks_in_lists&lt;&#x2F;td&gt;&lt;td&gt;How many inlinks are in some kind of list?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;number_of_outlinks&lt;&#x2F;td&gt;&lt;td&gt;How many outlinks are on the page?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;counting-links&quot;&gt;Counting Links&lt;&#x2F;h3&gt;
&lt;p&gt;For being relevant indexness a link must fullfill all of the following criteria:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;It is a link intended for document navigation (HTML &lt;code&gt;a&lt;&#x2F;code&gt;-tag)&lt;&#x2F;li&gt;
&lt;li&gt;Not &lt;code&gt;in_nav&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Not &lt;code&gt;in_aside&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;One of:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;in_main&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;in_article&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Neither &lt;code&gt;in_footer&lt;&#x2F;code&gt; or &lt;code&gt;in_header&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Each relevant link only gets counted once, being relevant for the &lt;code&gt;_in_headers&lt;&#x2F;code&gt; metrics beats being relevant for the &lt;code&gt;_in_lists&lt;&#x2F;code&gt; metrics if both would apply.&lt;&#x2F;p&gt;
&lt;p&gt;Note: the in nav&#x2F;aside requirement might be lifted for selflinks to not penalize correctly marked up tables of content.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;metrics-to-indexiness&quot;&gt;Metrics to Indexiness&lt;&#x2F;h2&gt;
&lt;pre data-lang=&quot;lua&quot; class=&quot;language-lua &quot;&gt;&lt;code class=&quot;language-lua&quot; data-lang=&quot;lua&quot;&gt;(selflinks_in_headers * -20 ) +
(selflinks_in_lists   * -5 ) +
(inlinks_in_headers   * 11 ) +
(inlinks_in_lists     * 5 ) +
(number_of_outlinks   * -1 ) +
if has_date_published { -25 } else { 0 } +
-10
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The metrics are weighted by how strong of an indicator they are compared to the weakest indicator.&lt;&#x2F;p&gt;
&lt;p&gt;The headstart of &lt;code&gt;-10&lt;&#x2F;code&gt; is chosen so that the page starts out relatively leafy, but one headline-inlink with no additional clues makes the score barely positive.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;known-misclassifications&quot;&gt;Known Misclassifications&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;inside-rust-blog-table-based-index-found-2024-08-24&quot;&gt;Inside Rust Blog, table based index (found: 2024-08-24)&lt;&#x2F;h3&gt;
&lt;p&gt;https:&#x2F;&#x2F;blog.rust-lang.org&#x2F;inside-rust&#x2F;&lt;&#x2F;p&gt;
&lt;p&gt;The index page makes use of tables, which are not recognized by the current algorithm iteration.&lt;&#x2F;p&gt;
&lt;p&gt;A quickfix would be to consider links in tables equivalent to links in lists, though this will clash with other usage modes of tables.&lt;&#x2F;p&gt;
&lt;p&gt;Another way of dealing with this is comparing the amount of table links to the total mount of relevant links and only apply that metric if a significant amount of links is inside tables. This threshold should be pretty high though.&lt;&#x2F;p&gt;
&lt;p&gt;Note that tables are also often used on leafy pages for inlinks to other pages:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;https:&#x2F;&#x2F;www.freedesktop.org&#x2F;software&#x2F;libqmi&#x2F;libqmi-glib&#x2F;latest&#x2F;libqmi-glib-DMS-Set-FCC-Authentication-response.html (gtk-doc genereated, current  correct negative indexiness)&lt;&#x2F;li&gt;
&lt;li&gt;https:&#x2F;&#x2F;wiki.postmarketos.org&#x2F;wiki&#x2F;Qualcomm_Snapdragon_415&#x2F;615&#x2F;616_(MSM8929&#x2F;MSM8939) (Table-heavy wiki page, current semi-correct indexiness of 6)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;blogger-multiple-issues-with-blog-entries-fully-replicated-on-index-page-found-2024-10-20&quot;&gt;Blogger, multiple issues with blog entries fully replicated on index page (found: 2024-10-20)&lt;&#x2F;h3&gt;
&lt;p&gt;https:&#x2F;&#x2F;www.righto.com&#x2F;2019&#x2F;&lt;&#x2F;p&gt;
&lt;p&gt;The index pages basically have the full articles content here, including an &lt;code&gt;&amp;lt;a href=&quot;https:&#x2F;&#x2F;static.…&quot;&amp;gt;&amp;lt;img …&amp;gt;&amp;lt;&#x2F;a&amp;gt;&lt;&#x2F;code&gt; construct. The links that make the images interactive shouldn&#x27;t affect the indexiness algorithm and should be treated as part of the images.&lt;&#x2F;p&gt;
&lt;p&gt;The alt text of the &lt;code&gt;img&lt;&#x2F;code&gt; tags doesn&#x27;t get counted as text on the link wrapped around the images, so these are basically empty.&lt;&#x2F;p&gt;
&lt;p&gt;Quickfix would be ignoring links without text on them.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; After applying the quickfix the page still gets misclassified as a leaf-page with a score of -279 (instead of roughly -380 before).&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawl Loop</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2025-08-06T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/algorithm/crawl-loop/"/>
        <id>https://doc.unobtanium.rocks/algorithm/crawl-loop/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/algorithm/crawl-loop/">&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; The crawl loop is built into the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;#crawl&quot;&gt;&lt;code&gt;unobtanium-crawler crawl&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; command.&lt;&#x2F;p&gt;
&lt;p&gt;This document describes how the unobtanium Crawler works. Discovers new pages and deals with existing pages.&lt;&#x2F;p&gt;
&lt;p&gt;Inputs:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;&quot;&gt;Crawler configuration&lt;&#x2F;a&gt;
&lt;ul&gt;
&lt;li&gt;Seed URLs&lt;&#x2F;li&gt;
&lt;li&gt;Crawl Policies:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;do_not_crawl&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;url_query_paramters&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;From Crawler Database:
&lt;ul&gt;
&lt;li&gt;Found links and recrawl information (Crawl Candidates)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Outputs:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;To Crawler Database:
&lt;ul&gt;
&lt;li&gt;Found links and recrawl information (Crawl Candidates)&lt;&#x2F;li&gt;
&lt;li&gt;Redirect information&lt;&#x2F;li&gt;
&lt;li&gt;File metadata&lt;&#x2F;li&gt;
&lt;li&gt;File content&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The crawler splits itself up into one crawl loop for each URL origin that is configured (OriginCrawler), this makes internals simpler as the data for one site is neatly kept together.
Which origins are crawled is derived from the seed-URLs. If a found link has the same origin as a configured seed, it will be considered for crawling.&lt;&#x2F;p&gt;
&lt;p&gt;Crawling stops after a preconfigured number of crawling actions have happened or if there are no more URLs that need crawling.&lt;&#x2F;p&gt;
&lt;p&gt;Example: If the crawler is configured for &lt;code&gt;example.org&lt;&#x2F;code&gt; and finds a link to &lt;code&gt;example.com&lt;&#x2F;code&gt; it won&#x27;t crawl that link with it&#x27;s &lt;code&gt;example.org&lt;&#x2F;code&gt; crawler. If &lt;code&gt;example.com&lt;&#x2F;code&gt; also happens to be configured, it will be crawled as part of that, otherwise the link will not be used.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;crawl-candidates&quot;&gt;Crawl Candidates&lt;&#x2F;h2&gt;
&lt;p&gt;Crawl Candidates are all URLs that are considered for crawling along with optional recrawl information, they are kept in the Crawler Database.&lt;&#x2F;p&gt;
&lt;p&gt;Initial crawl candidates are the seed URLs and candidates from the previous crawls already in the database.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;redirects&quot;&gt;Redirects&lt;&#x2F;h2&gt;
&lt;p&gt;The crawler never follows redirects, it saves the fact that an URL redirected and marks the target as a Crawl Candidate.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;crawl-inhibitors&quot;&gt;Crawl inhibitors&lt;&#x2F;h2&gt;
&lt;p&gt;A URL might be discarded from crawling for a number of reasons:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The URL doesn&#x27;t share an origin with one of the configured seeds.&lt;&#x2F;li&gt;
&lt;li&gt;The URL is excluded by a &lt;code&gt;do_not_crawl&lt;&#x2F;code&gt; policy in the crawler configuration.&lt;&#x2F;li&gt;
&lt;li&gt;The URL has a combination of query parameters not explicitly allowed by a &lt;code&gt;url_query_parameters&lt;&#x2F;code&gt; policy.&lt;&#x2F;li&gt;
&lt;li&gt;The &lt;code&gt;robots.txt&lt;&#x2F;code&gt; file denies crawling&lt;&#x2F;li&gt;
&lt;li&gt;The URL has already been crawled and is not due for a recrawl.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Crawl results will be discarded if:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The crawled resource marks itself as non-canonical, the canonical URL will be marked as a crawl candidate.&lt;&#x2F;li&gt;
&lt;li&gt;The crawled resource marks itself as not wanting to be indexed (using &lt;code&gt;meta&lt;&#x2F;code&gt; &lt;code&gt;robots&lt;&#x2F;code&gt; &lt;code&gt;noindex&lt;&#x2F;code&gt;), the unobtanium crawler respects that.&lt;&#x2F;li&gt;
&lt;li&gt;The page marks itself as not containing links that a crawler should follow (using &lt;code&gt;meta&lt;&#x2F;code&gt; &lt;code&gt;robots&lt;&#x2F;code&gt; &lt;code&gt;nofollow&lt;&#x2F;code&gt;), in this case links aren&#x27;t saved to the database.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;the-crawl-loop&quot;&gt;The Crawl loop&lt;&#x2F;h2&gt;
&lt;p&gt;All crawlers run simultanously, alternating between doing requests and &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-delay&#x2F;&quot;&gt;idling to not overwhelm web servers&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The Origin Crawlers have the following state relevant for crawling:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Their robots.txt information&lt;&#x2F;li&gt;
&lt;li&gt;An expiry time for the robots.txt information&lt;&#x2F;li&gt;
&lt;li&gt;A todo-list - initalized to the seed URLs&lt;&#x2F;li&gt;
&lt;li&gt;A temporary ignore list in the database&lt;&#x2F;li&gt;
&lt;li&gt;The time to wait between requests&lt;&#x2F;li&gt;
&lt;li&gt;Patience counter - initialized to &lt;code&gt;5&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Each iteration the crawler does one of three things:&lt;&#x2F;p&gt;
&lt;h3 id=&quot;1-fetch-the-robots-txt-file&quot;&gt;[1] Fetch the &lt;code&gt;robots.txt&lt;&#x2F;code&gt; file&lt;&#x2F;h3&gt;
&lt;p&gt;If there is no robots.txt file or after expiry of the old one (after 30 minutes, currently hard coded) the origin crawler tries to fetch the &lt;code&gt;&#x2F;robots.txt&lt;&#x2F;code&gt; file, parse it and store it. If no robots.txt is found the crawler assumes that it is okay to crawl.&lt;&#x2F;p&gt;
&lt;p&gt;The time to wait between requests is set to the one given in the robots.txt if it contains a crawl delay.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; This is implemented by the &lt;code&gt;DomainInformationLibrary&lt;&#x2F;code&gt; struct.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;2-fill-the-todo-list&quot;&gt;[2] Fill the todo list&lt;&#x2F;h3&gt;
&lt;p&gt;If the todo-list is empty the crawler will query crawl-candidates from the crawl database that:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;match the crawlers assigned origin&lt;&#x2F;li&gt;
&lt;li&gt;have either never been crawled or are due for a recrawl&lt;&#x2F;li&gt;
&lt;li&gt;are not on the temporary ignore list&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;All URLs that have been in this crawling stage are saved to a temporary ignore-list that is handled by the database to ensure no URL gets considered twice. This ensures that even if the recrawl interval is set very low the crawler &quot;finishes&quot; a site and only recrawls on it&#x27;s next invocation, it also allows the crawler to recognize when it is finished.&lt;&#x2F;p&gt;
&lt;p&gt;If the database doesn&#x27;t have any URLs for crawling the origin crawler signals that it has finished and gets removed from the scheduler.&lt;&#x2F;p&gt;
&lt;p&gt;It applies the &lt;code&gt;do_not_crawl&lt;&#x2F;code&gt; policies, if an URL is denied by such a policy the crawler logs it as a crawl with the &lt;code&gt;BLOCKED_URL_BY_LOCAL_POLICY&lt;&#x2F;code&gt; exit code.&lt;&#x2F;p&gt;
&lt;p&gt;If the URL has a query part, the &lt;code&gt;url_query_parameters&lt;&#x2F;code&gt; policies are evaluated. If no applicable policy allows the combination of URL parameters the crawler also logs the crawl as &lt;code&gt;BLOCKED_URL_BY_LOCAL_POLICY&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;It checks if the &lt;code&gt;robots.txt&lt;&#x2F;code&gt;, if the URL shouldn&#x27;t be crawled it gets logged as a crawl with the &lt;code&gt;BLOCKED_BY_ROBOTS_TXT&lt;&#x2F;code&gt; exit code.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;3-crawl-an-url&quot;&gt;[3] Crawl an URL&lt;&#x2F;h3&gt;
&lt;p&gt;If there is an URL on the todo-list it is taken off and handed to the scraper for fetching and crawl-time scraping, this will extract URLs relevant for the crawler, such as from links (html &lt;code&gt;a&lt;&#x2F;code&gt; elements), redirects and canonical URLs (in case a site marks itself as not canonical) and mark them as crawl candidates.&lt;&#x2F;p&gt;
&lt;p&gt;After that happened the CrawlCandidate will be updated with information necessary for the recrawl to recognize if a resource changed since the last crawl.&lt;&#x2F;p&gt;
&lt;p&gt;If the exit code indicates that the crawl has been rate-limited the tine to wait between requests is increased and the URL un-ignored so that it can be recawled in the same run.&lt;&#x2F;p&gt;
&lt;p&gt;If the error seems like a temporary condition that might have been caused by the network, the URL is un-ignored so that it can be recrawled in the same run.&lt;&#x2F;p&gt;
&lt;p&gt;If the conditions seems like it could persist the patience counter is decreased by one, when the counter reaches zero the origin crawler signals that is is &quot;finished&quot; because crawling with a broken connection or an otherwise offline site or server that speaks gibberish makes little sense.&lt;&#x2F;p&gt;
&lt;p&gt;If a database error is encountered the patience counter is set to zero immediately.&lt;&#x2F;p&gt;
&lt;p&gt;In any case the final crawl result is saved to the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-log&#x2F;&quot;&gt;crawl log&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;notable-changes&quot;&gt;Notable Changes&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;strong&gt;2025-08-05:&lt;&#x2F;strong&gt; The crawlers now run concurrently as part of the &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;issues&#x2F;19&quot;&gt;Crawling at the speed of politeness&lt;&#x2F;a&gt; goal, this version introduced the politeness based delay algorithm. Before that the non concurrent crawlers were taking turns in a round robin like scheduling pattern.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Databse Batching (Pattern)</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2024-10-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/algorithm/database-batching-pattern/"/>
        <id>https://doc.unobtanium.rocks/algorithm/database-batching-pattern/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/algorithm/database-batching-pattern/">&lt;p&gt;Batching up database queries is something you&#x27;ll see a lot in unobtanium, but mostly in the summarizer wich should, in the best case process as many pages as possible as fast as possible.&lt;&#x2F;p&gt;
&lt;p&gt;The reason why it is faster boils down to SQL parsing and Query overhead, which is the same, no matter if the database query is used once, or 1024 times, so just let the database do the work once and then let the main logic do some work before calling out to the database again.&lt;&#x2F;p&gt;
&lt;p&gt;The methods on the database explicitly made for batching are suffixed &lt;code&gt;_bulk&lt;&#x2F;code&gt;, otherwise they have the same name as the regular version (if there is one). These methods usually take a &lt;code&gt;Vec&lt;&#x2F;code&gt; of data and then return a &lt;code&gt;HashMap&lt;&#x2F;code&gt; of data.&lt;&#x2F;p&gt;
&lt;p&gt;When something is implemented as batched the code alternates between building the list of arguments for the next database queries, possibly filtering them and doing the queries. Where it makes sense there are checks in place that make the algorithm abort early if all wirk has been filtered out (i.e. in the summarizer after checking for already integrated files).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation-considerations&quot;&gt;Implementation Considerations&lt;&#x2F;h2&gt;
&lt;p&gt;When btching make sure you know your data-flow and be aware that batching may interfere with how data is selected.&lt;&#x2F;p&gt;
&lt;p&gt;This may result in neccessary checks against the current batch in addition to checks against the database.&lt;&#x2F;p&gt;
&lt;p&gt;A case &lt;a href=&quot;https:&#x2F;&#x2F;codeberg.org&#x2F;unobtanium&#x2F;unobtanium&#x2F;issues&#x2F;22&quot;&gt;where batching lead to an undesired edge case&lt;&#x2F;a&gt; is in the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;&quot;&gt;summarizing algorithm&lt;&#x2F;a&gt; where the test for &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#self-duplicates&quot;&gt;self-duplicates&lt;&#x2F;a&gt; relies on testing against all already integrated data, but fails to check against the data that is in the pipeline which causes problems if the crawl results are shorter than the batch-size.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Entity Data Tree</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2024-10-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/concept/entity-data-tree/"/>
        <id>https://doc.unobtanium.rocks/concept/entity-data-tree/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/concept/entity-data-tree/">&lt;p&gt;Every Entity in the Unobtanium summary database may have multiple kinds of data attached to it, apart from the generation metadata all of that data is optional. Because the database doesn&#x27;t know any Entity data structured directly they are uniquely identified by their entity generation UUID for organically grown reasons.&lt;&#x2F;p&gt;
&lt;p&gt;An entity may have:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;one mandatory &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;entity-generation&#x2F;&quot;&gt;Entity Generation&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;multiple &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-summary&#x2F;&quot;&gt;Crawl Summaries&lt;&#x2F;a&gt;
&lt;ul&gt;
&lt;li&gt;one &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;crawl-summary&#x2F;#http-summary&quot;&gt;HTTP Summary&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;one File Summary
&lt;ul&gt;
&lt;li&gt;one Document Description&lt;&#x2F;li&gt;
&lt;li&gt;one &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen1&#x2F;&quot;&gt;Text Pile Gen1&lt;&#x2F;a&gt;
&lt;ul&gt;
&lt;li&gt;multiple Token Statistics entries&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;one &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;text-pile-gen2&#x2F;&quot;&gt;Text Pile Gen2&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;one Redirect Summary&lt;&#x2F;li&gt;
&lt;li&gt;multiple Link Summaries&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Note that Text Piles are deduplicated and in the actual database. An entity generation points to the text pile via its ID, this is abstracted away by the database code in lib-unobtanium.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Entity Generation</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2025-08-14T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/entity-generation/"/>
        <id>https://doc.unobtanium.rocks/data/entity-generation/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/entity-generation/">&lt;p&gt;An Entity Generation is a concept and data structure used by Unobtanium to have some kind of space&#x2F;time coordinate for any given version of queryable resource (an entity).&lt;&#x2F;p&gt;
&lt;p&gt;Entity Generations live in the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;summary-database&#x2F;#entity-generation&quot;&gt;summary database&lt;&#x2F;a&gt; and are given a UUID for unique and stable identification.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; Recommended are UUID v7 or v4 as they both have huge random components. v7 is interesting for data transparency as it in theory allows tracing when the data was generated.&lt;&#x2F;p&gt;
&lt;p&gt;See also:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;entity-data-tree&#x2F;&quot;&gt;The Entity Data Tree Concept&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;unobtanium&#x2F;latest&#x2F;unobtanium&#x2F;summary&#x2F;struct.EntityGeneration.html&quot;&gt;EntityGeneration on docs.rs&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;open-closed-state&quot;&gt;Open &#x2F; Closed State&lt;&#x2F;h2&gt;
&lt;p&gt;Each entity generation has a few timestamps:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;first_seen&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;last_seen&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;time_end_confirmed&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The &lt;code&gt;first_seen&lt;&#x2F;code&gt; and &lt;code&gt;last_seen&lt;&#x2F;code&gt; timestamps are always set and can be used to determine when the entity generation definitely existed in the recorded form.&lt;&#x2F;p&gt;
&lt;p&gt;As long as the entity can be assumed to exist in the recorded form the entity generation is considered &lt;i&gt;open&lt;&#x2F;i&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;time_end_confirmed&lt;&#x2F;code&gt; timestamp is set when unobtanium learns that the entity has changed. When that happened the entity generation is considered &lt;i&gt;closed&lt;&#x2F;i&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;fields&quot;&gt;Fields&lt;&#x2F;h2&gt;




	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;code&gt;url&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;code&gt;Url&lt;&#x2F;code&gt; The url under which the given entity lives
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;uuid&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;code&gt;Uuid&lt;&#x2F;code&gt; The UUID that identifies the entity generation
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;first_seen&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;code&gt;UtcTimestamp&lt;&#x2F;code&gt; First known existence of this generation (may change if better data is integrated)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;last_seen&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;code&gt;UtcTimestamp&lt;&#x2F;code&gt; Last known existence, may equal &lt;code&gt;first_seen&lt;&#x2F;code&gt;, updated every time a newer request confirms that the entity generation is still open.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;marked_duplicate&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;code&gt;bool&lt;&#x2F;code&gt; Caches the current &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;&quot;&gt;duplicate&lt;&#x2F;a&gt; status to avoid uneccessary queries to the duplicate summary table.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;code&gt;time_end_confirmed&lt;&#x2F;code&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;code&gt;Option&amp;lt;UtcTimestamp&amp;gt;&lt;&#x2F;code&gt; If set the entity generation is considered no longer live, set to the time that this has been confirmed. (Usually the &lt;code&gt;first_seen&lt;&#x2F;code&gt; of the next entity generation) In cases of temorary outages this might reset back to &lt;code&gt;None&lt;&#x2F;code&gt; when the entity comes back.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;updates&quot;&gt;Updates&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;2024-10-12-improved-timestamping&quot;&gt;2024-10-12 Improved Timestamping&lt;&#x2F;h3&gt;
&lt;p&gt;Removed fields:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;time_started&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;known_lifetime_seconds&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Added fields:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;first_seen&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;last_seen&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;time_end_confirmed&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Before this update every entity generation had a &lt;code&gt;time_started&lt;&#x2F;code&gt; and a &lt;code&gt;known_lifetime_seconds&lt;&#x2F;code&gt;, this was put in place before unobtanium was really able to work with multiple versions of entity generation for one URL. As it turns out these are quite impractical to work with.&lt;&#x2F;p&gt;
&lt;p&gt;They were replaced by start and end times for the range that was observed &quot;seen&quot; and a definitive end time of the entity generation at which it was confiemed to be no longer there.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;2024-10-26-duplicate-marker&quot;&gt;2024-10-26 Duplicate Marker&lt;&#x2F;h3&gt;
&lt;p&gt;Added fields:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;marked_duplicate&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;A duplicate marker has been added that follows the current duplicate status for an entity generation as that saves a lot of query time during search.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Link Locality</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2024-10-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/link-locality/"/>
        <id>https://doc.unobtanium.rocks/data/link-locality/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/link-locality/">&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;unobtanium&#x2F;latest&#x2F;unobtanium&#x2F;content&#x2F;enum.LinkLocality.html&quot;&gt;LinkLocality&lt;&#x2F;a&gt; relates the URL of a link of a document to the link destination.&lt;&#x2F;p&gt;
&lt;p&gt;Link locality is a tool for analyzing a sites structure, it is &lt;strong&gt;not intended or fit for security purposes&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Possible Values are:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Id&lt;&#x2F;th&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;SelfLink&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;InLink&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;OutLink&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Self-links refer to the same document as they are on, they only change the &lt;code&gt;fragment&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In-links stay on the same site, they are allowed to change the &lt;code&gt;path&lt;&#x2F;code&gt;, &lt;code&gt;query&lt;&#x2F;code&gt; and &lt;code&gt;fragment&lt;&#x2F;code&gt;. In addition they are allowed to switch between the encrypted version of a protocol and the non-encrypted one. (The default implementation allows appending or removing an &lt;code&gt;s&lt;&#x2F;code&gt; suffix in the &lt;code&gt;scheme&lt;&#x2F;code&gt; independent of the used protocol)&lt;&#x2F;p&gt;
&lt;p&gt;Out-links are all other links. Links without a hostname are also considered outlinks by the default implementation. They are also the sane default to fall back to in case of doubt or lack of information.&lt;&#x2F;p&gt;
&lt;p&gt;Please note that the default derivation implementation is currently made for websites and is intended to be wrapped by a custom implementation should the need arise.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Location Signature</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2024-10-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/location-signature/"/>
        <id>https://doc.unobtanium.rocks/data/location-signature/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/location-signature/">&lt;p&gt;The Location Signature is a datastructure for describing where — semantically speaking — an element is in a (HTML) document. It is implemented as a bunch of flags that get set according to the elements parent-elements.&lt;&#x2F;p&gt;
&lt;p&gt;See &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;unobtanium&#x2F;latest&#x2F;unobtanium&#x2F;content&#x2F;struct.LocationSignature.html&quot;&gt;LocationSignature on docs.rs&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The fields are named after the corresponding HTML element.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Field&lt;&#x2F;th&gt;&lt;th&gt;HTML-Tags&lt;&#x2F;th&gt;&lt;th&gt;Notes&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_header&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;header&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_footer&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;footer&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_aside&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;aside&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_nav&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;nav&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_form&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;form&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_main&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;main&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_article&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;article&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_section&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;section&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_table&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;table&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_figure&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;figure&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_address&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;address&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_code&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;code&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_headline&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;h1&lt;&#x2F;code&gt; – &lt;code&gt;h6&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_list&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;ul&lt;&#x2F;code&gt;, &lt;code&gt;ol&lt;&#x2F;code&gt;, &lt;code&gt;dl&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;in_paragraph&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;p&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The location signature is mainly used to extract the purpose of a link.&lt;&#x2F;p&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;max-one-dd&quot;&gt;

	
		&lt;dt&gt;
		1
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Depending on document structure, if no &lt;code&gt;in_main&lt;&#x2F;code&gt; is present &lt;code&gt;in_article&lt;&#x2F;code&gt; might be a &quot;good enough&quot; alternative.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		2
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		&lt;code&gt;section&lt;&#x2F;code&gt; has no real semantic meaning, but it might be useful to separate useful content from fluff on sites that have little to no other semantic markup.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;serialized-representation&quot;&gt;Serialized Representation&lt;&#x2F;h2&gt;
&lt;p&gt;The Location signature is always represented as key-value pairs, fields set to false are not serialized.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>URL</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2024-10-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/data/url/"/>
        <id>https://doc.unobtanium.rocks/data/url/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/data/url/">&lt;p&gt;Unobtanium makes heavy use of the &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;url&quot;&gt;url crate&lt;&#x2F;a&gt; and inherits its way of representing parsed URLs.&lt;&#x2F;p&gt;
&lt;p&gt;Also see: &lt;a href=&quot;http:&#x2F;&#x2F;url.spec.whatwg.org&#x2F;&quot;&gt;The URL Standard&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;URL Component&lt;&#x2F;th&gt;&lt;th&gt;Datatype&lt;&#x2F;th&gt;&lt;th&gt;If absent&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;scheme&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;str&lt;&#x2F;td&gt;&lt;td&gt;empty&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;username&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;str&lt;&#x2F;td&gt;&lt;td&gt;empty&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;password&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Option&amp;lt;str&amp;gt;&lt;&#x2F;td&gt;&lt;td&gt;empty&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;host&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Option&amp;lt;str&amp;gt;&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;port&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Option&amp;lt;u16&amp;gt;&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;path&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;str&lt;&#x2F;td&gt;&lt;td&gt;empty&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;query&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Option&amp;lt;str&amp;gt;&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;fragment&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Option&amp;lt;str&amp;gt;&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;All values are taken as is, for &quot;special&quot; URLs (mostly &lt;code&gt;http&lt;&#x2F;code&gt; and &lt;code&gt;https&lt;&#x2F;code&gt;) the hostnames are punycode encoded, otherwise they are percent encoded.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;origin&quot;&gt;Origin&lt;&#x2F;h2&gt;
&lt;p&gt;Unobtanium does not follow the url crate when it comes to the URL origin implementation. For unobtanium the &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;unobtanium&#x2F;latest&#x2F;unobtanium&#x2F;struct.Origin.html&quot;&gt;Origin&lt;&#x2F;a&gt; is the &lt;code&gt;scheme&lt;&#x2F;code&gt;, &lt;code&gt;host&lt;&#x2F;code&gt; and &lt;code&gt;port&lt;&#x2F;code&gt; taken from the URL and put into a struct. For &lt;code&gt;http&lt;&#x2F;code&gt;, &lt;code&gt;https&lt;&#x2F;code&gt;, &lt;code&gt;ws&lt;&#x2F;code&gt;, &lt;code&gt;wss&lt;&#x2F;code&gt; and &lt;code&gt;ftp&lt;&#x2F;code&gt; the default port numbers are automatically filled in if no port is mentioned in the URL using the &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;url&#x2F;latest&#x2F;url&#x2F;struct.Url.html#method.port_or_known_default&quot;&gt;port_or_known_default() getter&lt;&#x2F;a&gt;. If two origins are represented using equal data, they equal each other, this simplifies matching and comparing logic across the whole codebase.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;criteria&quot;&gt;Criteria&lt;&#x2F;h2&gt;
&lt;p&gt;URLs and Origins can be matched using the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;url&#x2F;&quot;&gt;Url Criterium&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;criteria&#x2F;origin&#x2F;&quot;&gt;Origin Criterium&lt;&#x2F;a&gt;. Their naming and types follow the above table.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Crawler User-Agent and robots.txt</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2024-10-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/crawler-user-agent-and-robots-txt/"/>
        <id>https://doc.unobtanium.rocks/manual/crawler-user-agent-and-robots-txt/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/crawler-user-agent-and-robots-txt/">&lt;p&gt;Crawling responsibly on the web, also means giving other a chance to indicate how they want to be crawled, if at all.&lt;&#x2F;p&gt;
&lt;p&gt;Since the unobtanium-crawler can be used for multiple purposes it allows configuring it&#x27;s user agent, either in the crawl configuration or via a command line option to the &lt;code&gt;run&lt;&#x2F;code&gt; subcommand.&lt;&#x2F;p&gt;
&lt;p&gt;If left unconfigured the crawler uses the user agent &lt;code&gt;unobtanium-crawler-unconfigured-ua&lt;&#x2F;code&gt;, outside of temporary testing, use of this user agent is &lt;strong&gt;heavily discouraged&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;When setting you own user agent please follow the following pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code&gt;{your-project}-crawler (for {url})
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Where &lt;code&gt;{your-project}&lt;&#x2F;code&gt; is a representative name of what you are using the data for and &lt;code&gt;{url}&lt;&#x2F;code&gt; leads to a website describing or showing what you are using the data for.&lt;&#x2F;p&gt;
&lt;p&gt;An example would be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code&gt;example-net-crawler (for https:&amp;#x2F;&amp;#x2F;example.net&amp;#x2F;why-we-crawl-your-website)
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;robots-txt&quot;&gt;robots.txt&lt;&#x2F;h2&gt;
&lt;p&gt;Everything before the first space in the user agent (i.e. &lt;code&gt;example-net-crawler&lt;&#x2F;code&gt;) will be used to match against the &lt;code&gt;robots.txt&lt;&#x2F;code&gt; file, if you are an operator you can just copy that part over.&lt;&#x2F;p&gt;
&lt;p&gt;If the crawler finds itself not addressed by name in the robots.txt it follows the rules set for all bots (&lt;code&gt;*&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;The crawler re-fetches the robots.txt file every 30 minutes, so stopping a rouge crawler is possible within a reasonable timeframe. Sending it &lt;code&gt;429&lt;&#x2F;code&gt; response codes will also increase the crawl delay slightly every-time the reponse hits. Alternatively a crawler can be stopped by simulating an outage with &lt;code&gt;503&lt;&#x2F;code&gt; responses, it will give up within a few requests.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;b&gt;Note:&lt;&#x2F;b&gt; Unobtanium is free software, and while responsible use is encouraged, it can be modified to not respect all of the above. Such modifications while not forbidden are &lt;strong&gt;heavily discouraged&lt;&#x2F;strong&gt; by the developer.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-to-set-the-user-agent-for-unobtanium&quot;&gt;How to set the user agent for unobtanium?&lt;&#x2F;h2&gt;
&lt;p&gt;There are two ways to set the user agent for the unobtanium crawler:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The recommended way is to use the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#option-user-agent&quot;&gt;&lt;code&gt;user_agent&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; option in the configuration file.&lt;&#x2F;li&gt;
&lt;li&gt;For testing you can use the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;unobtanium-crawler&#x2F;#crawl-u-user-agent&quot;&gt;&lt;code&gt;--user-agent&lt;&#x2F;code&gt; command line option on the &lt;code&gt;crawl&lt;&#x2F;code&gt; subcommand&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Terminology</title>
        <published>2024-10-28T00:00:00+00:00</published>
        <updated>2024-10-28T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://doc.unobtanium.rocks/manual/terminology/"/>
        <id>https://doc.unobtanium.rocks/manual/terminology/</id>
        
        <content type="html" xml:base="https://doc.unobtanium.rocks/manual/terminology/">&lt;p&gt;This page describes some terms used across Unobtanium which have a specific meaning within Unobtanium.&lt;&#x2F;p&gt;
&lt;p&gt;Outside of Unobtanium you might see these terms used more broadly.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;terms-referring-to-things-outside-of-unobtanium&quot;&gt;Terms referring to things outside of Unobtanium&lt;&#x2F;h2&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;max-one-dd&quot;&gt;

	
		&lt;dt&gt;
		Homepage
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		A webpage that is the entry-point and main navigation hub for a website. Each site may only have one homepage. It is usually linked in the header of each page on a name or icon representing the site.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		Webpage
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		A single Document with one main file and maybe some auxiliary files that can usually be fetched over HTTP(s).
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		Website&#x2F;Site
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		A website is a collection of one or more pages that belong together in some sense (even if it is just a collection of random pages). That collection usually is the set of all pages sharing the same origin, though this may not always be the case (i.e. the same site being accessible via HTTP and HTTPS or a public Unix system where each &lt;code&gt;&#x2F;~user&#x2F;&lt;&#x2F;code&gt; path has its own website, a big web presence available in many languages may also be considered one site per language.)
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;data&#x2F;url&#x2F;#origin&quot;&gt;URL-Origin&#x2F;Origin&lt;&#x2F;a&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		With an URLs Origin Unobtanium refers to the combination of the protocol, hostname and port fields. Origins which omit the port and origins which explicitly specify the standard port for an URL are considered equivalent.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;terms-referring-to-things-inside-unobtanium&quot;&gt;Terms referring to things inside Unobtanium&lt;&#x2F;h2&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;&quot;&gt;

	
		&lt;dt&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;manual&#x2F;crawler-crawl-configuration&#x2F;#policies&quot;&gt;Crawl Policy&lt;&#x2F;a&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		A crawl policy is a rulue that configured the &lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;Crawl Loop&lt;&#x2F;a&gt;, mostly regarding which pages it is allowed to crawl.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		Entity
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		An entity refers to a (search-)queryable entry in the Unobtanium database. Each entity is uniquely identified by its entity-generation-UUID.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		Entity-Generation
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		An entity-generation refers to the generation metadata attached to an entity describing its location (URL) and lifetime. If the result from querying a URL changes in a way that is significant to Unobtanium a new entity-generation is created.
		&lt;&#x2F;dd&gt;
	

	
		&lt;dd&gt;
		Since &quot;entity-generation&quot; is pretty long it is abbreviated &lt;code&gt;eg&lt;&#x2F;code&gt; in some places.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#exact-duplicates&quot;&gt;Exact Duplicate&lt;&#x2F;a&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		A Page with exactly the same main-content as another
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;patience-and-fluke-events&#x2F;#fluke-events&quot;&gt;Fluke-Event&lt;&#x2F;a&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		An unlikely and temporary error where immediately retrying is a valid strategy.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;patience-and-fluke-events&#x2F;#patience&quot;&gt;Patience&lt;&#x2F;a&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Patience is a metric in the crawler represented by a countdown, to make the crawler eventually give up on unreachable&#x2F;broken origins.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;concept&#x2F;duplicate-types&#x2F;#self-duplicates&quot;&gt;Self-Duplicate&lt;&#x2F;a&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Self-Duplicates are exact duplicates, but they came from the same URL. They are used as an indicator of whether a page has changed or not.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

&lt;h2 id=&quot;terms-referring-to-actions-inside-unobtanium&quot;&gt;Terms referring to actions inside Unobtanium&lt;&#x2F;h2&gt;




	
		
			
		
	
		
			
			
				
			
		
	
		
	
		
			
		
	
		
			
			
		
	
		
	
		
			
		
	
		
			
			
		
	


&lt;dl class=&quot;max-one-dd&quot;&gt;

	
		&lt;dt&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;crawl-loop&#x2F;&quot;&gt;Crawling&lt;&#x2F;a&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Crawling is the process of collecting resurces.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		Scraping
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Scraping is the process of extracting useful data from raw data that is mainly aimed at human readers. The raw data usually comes from a crawling step.
		&lt;&#x2F;dd&gt;
	

	

	
		&lt;dt&gt;
		&lt;a href=&quot;https:&#x2F;&#x2F;doc.unobtanium.rocks&#x2F;algorithm&#x2F;summarizing&#x2F;&quot;&gt;Summarizing&lt;&#x2F;a&gt;
		&lt;&#x2F;dt&gt;
	

	
		&lt;dd&gt;
		Summarizing is turning a lot of raw information into a usable and searchable summary. This usually involves scraping, but also other steps.
		&lt;&#x2F;dd&gt;
	

&lt;&#x2F;dl&gt;

</content>
        
    </entry>
</feed>
