Data: Base Database

The base database schema is shared between the crawler database and the summary database and provides basic facilities for storing data that is always needed in the context of a search engine.

Note: While the base database schema is often referred to as "The base database" it never is a standalone database. Its always embedded into another database.

Tables in the base database are:

unobtanium_database_info
Used to store global database metadata.
origin
URL Origins, queryable by component
url
URLs, queryable by component
mimetype
Mimetypes/Mediatypes
mime_parameter
Extends the mimetype table with SQL queryable parameter information

Tables

unobtanium_database_info

Added in version 0.1.0 the unobtanium_database_info table has the following fields:

key
Text Primary Key
value
Text "Schemaless" configuration data

The main purpose of this table is storing global configuration information about the database and its schema.

See also: Database Info

origin

The origin table has the following fields:

origin_id
Integer Primary Key
port
Integer Null The port as specified in the URL or a well known default
scheme
Varchar The scheme as specified in the URL
host
Varchar Null The hostname as specified in the URL, if present
str_origin
Text Unique The origin as a string, as if it was an URL consisting only of scheme, host and port, main purpose is deduplication.
crawl_delay_ms
Integer Null Delay in milliseconds that the crawler should wait between requests, taken from robots.txt
Deprecated since 0.0.0: This is crawler specific information that shouldn't be part of the base database.
Removed in 0.1.0

There is a unique constraint across the fields port, scheme and host. (Removed in Version 0.1.0)

See the Origin data type for more information.

url

The url table has the following fields:

url_id
Integer Primary Key
origin_id
Integer Reference into the origin table
path
Text Null Path part of the URL, if present
query
Text Null Query part of the URL, if present
username
Text Null Username part of the URL, if present
password
Text Null Password part of the URL, if present
str_url
Text Unique String representation of the whole URL for easy retrieval

The url table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document.

Fragment fields are stored together with URL ids where appropriate.

See the URL data type for more information.

mimetype

The mimetype table has the following fields:

mimetype_id
Integer Primary Key
mime_type
Text The part before the /
mime_subtype
Text The part after the /
mime_suffix
Text Null The part after the + if present
charset
Text Null The charset parameter. This is in here since it is by far the most common and webservers love sending it.
str_mimetype
Text Unique String representation of the whole mimetype with the parameters sorted in alphabetical order.

mime_parameter

The mime_parameter table has the following fields:

mime_parameter_id
Integer Primary Key Internal database id for the mime parameter, not used.
mimetype_id
Integer Reference into the mimetype table.
key
Text Key of the mimetype parameter
value
Text Value of the mimetype parameter

Version history

0.0.0 - The last unversioned

Version 0.0.0 represents the last unversioned version of the database schema.

0.1.0 - Basic cleanup (2025-04-10)

Version 0.1.0 cleans up some historic design choices.