The base database schema is shared between the crawler database and the summary database and provides basic facilities for storing data that is always needed in the context of a search engine.
Note: While the base database schema is often referred to as "The base database" it never is a standalone database. Its always embedded into another database.
Tables in the base database are:
unobtanium_database_info
- Used to store global database metadata.
origin
- URL Origins, queryable by component
url
- URLs, queryable by component
mimetype
- Mimetypes/Mediatypes
mime_parameter
- Extends the
mimetype
table with SQL queryable parameter information
Tables
unobtanium_database_info
Added in version 0.1.0 the unobtanium_database_info
table has the following fields:
The main purpose of this table is storing global configuration information about the database and its schema.
See also: Database Info
origin
The origin
table has the following fields:
origin_id
- Integer Primary Key
port
- Integer Null The port as specified in the URL or a well known default
scheme
- Varchar The scheme as specified in the URL
host
- Varchar Null The hostname as specified in the URL, if present
str_origin
-
Text Unique The origin as a string, as if it was an URL consisting only of
scheme
,host
andport
, main purpose is deduplication. crawl_delay_ms
- Integer Null Delay in milliseconds that the crawler should wait between requests, taken from robots.txt
- Deprecated since 0.0.0: This is crawler specific information that shouldn't be part of the base database.
- Removed in 0.1.0
There is a unique constraint across the fields (Removed in Version 0.1.0)port
, scheme
and host
.
See the Origin data type for more information.
url
The url
table has the following fields:
url_id
- Integer Primary Key
origin_id
-
Integer Reference into the
origin
table path
- Text Null Path part of the URL, if present
query
- Text Null Query part of the URL, if present
username
- Text Null Username part of the URL, if present
password
- Text Null Password part of the URL, if present
str_url
- Text Unique String representation of the whole URL for easy retrieval
The url
table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document.
Fragment fields are stored together with URL ids where appropriate.
See the URL data type for more information.
mimetype
The mimetype
table has the following fields:
mimetype_id
- Integer Primary Key
mime_type
-
Text The part before the
/
mime_subtype
-
Text The part after the
/
mime_suffix
-
Text Null The part after the
+
if present charset
-
Text Null The
charset
parameter. This is in here since it is by far the most common and webservers love sending it. str_mimetype
- Text Unique String representation of the whole mimetype with the parameters sorted in alphabetical order.
mime_parameter
The mime_parameter
table has the following fields:
mime_parameter_id
- Integer Primary Key Internal database id for the mime parameter, not used.
mimetype_id
-
Integer Reference into the
mimetype
table. key
- Text Key of the mimetype parameter
value
- Text Value of the mimetype parameter
Version history
0.0.0 - The last unversioned
Version 0.0.0 represents the last unversioned version of the database schema.
0.1.0 - Basic cleanup (2025-04-10)
Version 0.1.0 cleans up some historic design choices.
- For the
origin
table add a uniquestr_origin
and remove the unique constraint fromscheme
,host
andport
, which had some edge case problems because ofhost
andport
being nullable. - Remove the
crawl_delay_ms
column from theorigin
table. It stems from a very early version of the crawler and unused.