Base schema as of release 3.0

Overview

Tables in the base database are:

unobtanium_database_info
Global configuration information about the database and its schema.
Since 0.1.0
origin
URL Origins, queryable by component.
url
The url table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document. Fragment fields are stored together with URL ids where appropriate.
mimetype
Mimetypes/Mediatypes
mime_parameter
Extends the mimetype table with SQL queryable parameter information.

Tables

unobtanium_database_info

Global configuration information about the database and its schema.

The unobtanium_database_info table has the following fields:

key
Text Primary Key
value
Text "Schemaless" configuration data

origin

URL Origins, queryable by component.

The origin table has the following fields:

origin_id
Integer Primary Key
port
Integer Null The port as specified in the URL.
Null means a well known default.
scheme
Varchar The scheme as specified in the URL.
host
Varchar Null The hostname as specified in the URL.
Null means that no hostname was part of the URL.
str_origin
Text Unique The origin as a string, as if it was an URL consisting only of scheme, host and port, main purpose is deduplication.
crawl_delay_ms
Integer Delay in milliseconds that the crawler should wait between requests, taken from robots.txt.
🔧 Deprecated since 0.0.0: This is crawler specific information that shouldn't be part of the base database.
Removed in 0.1.0

Constraints:

url

The url table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document. Fragment fields are stored together with URL ids where appropriate.

The url table has the following fields:

url_id
Integer Primary Key
origin_id
Integer Reference into the origin table. Origin of the URL.
path
Text Null Path part of the URL.
Null means that no path was present.
query
Text Null Query part of the URL.
Null means that no query was present.
username
Text Null Username part of the URL.
Null means that no username was present.
password
Text Null Password part of the URL.
Null means that no password was present.
str_url
Text Unique String representation of the whole URL for easy retrieval.

mimetype

Mimetypes/Mediatypes

The mimetype table has the following fields:

mimetype_id
Integer Primary Key
mime_type
Text The part before the /.
mime_subtype
Text Teh part after the /.
mime_suffix
Text Null The part after the +.
Null means that this part is not present
charset
Text Null The charset parameter. This is in here since it is by far the most common and webservers love sending it.
Null means that no charset parameter was present.
str_mimetype
Text Unique String representation of the whole mimetype with the parameters sorted in alphabetical order.

mime_parameter

Extends the mimetype table with SQL queryable parameter information.

The mime_parameter table has the following fields:

mime_parameter_id
Text Primary Key Internal database id for the mime parameter, not used.
mimetype_id
Integer Reference into the mimetype table.
key
Text Key of the mimetype parameter.
value
Text Value of the mimetype parameter.