The base database schema is shared between the crawler database and the summary database and provides basic facilities for storing data that is aleays needed in the context of a search engine.
Note: While the base database schema is often referred to as "The base database" it never is a standalone database. Its always embedded into another database.
Tables in the base database are:
unobtanium_database_info
- 🔧 Under Development: Will be used to store global database metadata.
origin
- URL Origins, queryable by component
url
- URLs, queryable by component
mimetype
- Mimetypes/Mediatypes
mime_parameter
- Extends the
mimetype
table with SQL queryable parameter information
Tables
unobtanium_database_info
Added in version 0.1.0 the unobtanium_database_info
table has the following fields:
key
- [Text][Primary Key]
value
- [Text] "Schemaless" configuration data
The main purpose of this table will be storing global configuration informtion about the database and its schema.
origin
The origin
table has the following fields:
origin_id
- [Integer][Primary Key]
port
- [Integer][Null] The port as specified in the URL or a well known default
scheme
- [Varchar] The scheme as specified in the URL
host
- [Varchar][Nill] The hostname as specified in the URL, if present
str_origin
- [Text][Unique] The origin as a string, as if it was an URL consisting only of
scheme
,host
andport
, main purpose is deduplication. crawl_delay_ms
- [Integer][Null] Delay in milliseconds that the crawler should wait between requests, taken from robots.txt
- Deprecated since 0.0.0: This is crawler specific information that shouldn't be part of the base database.
- Removed in 0.1.0
There is a unique contraint across the fields (Removed in Version 0.1.0)port
, scheme
and host
.
See the Origin data type for more information.
url
The url
table has the following fields:
url_id
- [Integer][Primary Key]
origin_id
- [Integer] Reference into the
origin
table path
- [Text][Null] Path prt of the URL, if present
query
- [Text][Null] Query part of the URL, if present
username
- [Text][Null] Username part of the URL, if present
password
- [Text][Null] Password part of the URL, if present
str_url
- [Text][Unique] String representation of the whole URL for easy retrieval
The url
table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document.
Fragment fields are stored together with URL ids where appropriate.
See the URL data type for more information.
mimetype
The mimetype
table has the following fields:
mimetype_id
- [Integer][Primary Key]
mime_type
- [Text] The part before the
/
mime_subtype
- [Text] The part after the
/
mime_suffix
- [Text][Null] The part after the
+
if present charset
- [Text][Null] The
charset
parameter. This is in here since it is by fasr the most common and webservers love sending it. str_mimetype
- [Text][Unique] String representation of the whole mimetype with the parameters sorted in alphabetical order.
mime_parameter
The mime_parameter
table has the following fields:
mime_parameter_id
- [Integer][Primary Key] Internal database id for the mime parameter, not used.
mimetype_id
- [Integer] Reference into the
mimetype
table. key
- [Text] Key of the mimetype parameter
value
- [Text] Value of the mimetype parameter
Version history
0.0.0 - The last unversioned
Version 0.0.0 represents the last unversioned version of the database schema.
0.1.0 - Basic cleanup (2025-04-10)
Version 0.1.0 cleans up some historic design choices.
- For the
origin
table add a uniquestr_origin
and remove the unique constraint fromscheme
,host
andport
, which had some edge case problems because ofhost
andport
being nullable. - Remove the
crawl_delay_ms
column from theorigin
table. It stems from a very early version of the crawler and unused.