Data: Base Database

The base database schema is shared between the crawler database and the summary database and provides basic facilities for storing data that is aleays needed in the context of a search engine.

Note: While the base database schema is often referred to as "The base database" it never is a standalone database. Its always embedded into another database.

Tables in the base database are:

unobtanium_database_info
🔧 Under Development: Will be used to store global database metadata.
origin
URL Origins, queryable by component
url
URLs, queryable by component
mimetype
Mimetypes/Mediatypes
mime_parameter
Extends the mimetype table with SQL queryable parameter information

Tables

unobtanium_database_info

The unobtanium_database_info table has the following fields:

key
[Text][Primary Key]
value
[Text] "Schemaless" configuration data

The main purpose of this table will be storing global configuration informtion about the database and its schema.

origin

The origin table has the following fields:

origin_id
[Integer][Primary Key]
port
[Integer][Null] The port as specified in the URL or a well known default
scheme
[Varchar] The scheme as specified in the URL
host
[Varchar][Nill] The hostname as specified in the URL, if present
crawl_delay_ms
[Integer][Null] Delay in milliseconds that the crawler should wait between requests, taken from robots.txt
Deprecated since 0.0.0: This is crawler specific information that shouldn't be part of the base database.

There is a unique contraint across the fields port, scheme and host.

See the Origin data type for more information.

url

The url table has the following fields:

url_id
[Integer][Primary Key]
origin_id
[Integer] Reference into the origin table
path
[Text][Null] Path prt of the URL, if present
query
[Text][Null] Query part of the URL, if present
username
[Text][Null] Username part of the URL, if present
password
[Text][Null] Password part of the URL, if present
str_url
[Text][Unique] String representation of the whole URL for easy retrieval

The url table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document.

Fragment fields are stored together with URL ids where appropriate.

See the URL data type for more information.

mimetype

The mimetype table has the following fields:

mimetype_id
[Integer][Primary Key]
mime_type
[Text] The part before the /
mime_subtype
[Text] The part after the /
mime_suffix
[Text][Null] The part after the + if present
charset
[Text][Null] The charset parameter. This is in here since it is by fasr the most common and webservers love sending it.
str_mimetype
[Text][Unique] String representation of the whole mimetype with the parameters sorted in alphabetical order.

mime_parameter

The mime_parameter table has the following fields:

mime_parameter_id
[Integer][Primary Key] Internal database id for the mime parameter, not used.
mimetype_id
[Integer] Reference into the mimetype table.
key
[Text] Key of the mimetype parameter
value
[Text] Value of the mimetype parameter

Version history

0.0.0 - The last unversioned

Version 0.0.0 represents the last unversioned version of the database schema.