The base database schema is shared between the crawler database and the summary database and provides basic facilities for storing data that is aleays needed in the context of a search engine.
Note: While the base database schema is often referred to as "The base database" it never is a standalone database. Its always embedded into another database.
Tables in the base database are:
unobtanium_database_info
- 🔧 Under Development: Will be used to store global database metadata.
origin
- URL Origins, queryable by component
url
- URLs, queryable by component
mimetype
- Mimetypes/Mediatypes
mime_parameter
- Extends the
mimetype
table with SQL queryable parameter information
Tables
unobtanium_database_info
The unobtanium_database_info
table has the following fields:
key
- [Text][Primary Key]
value
- [Text] "Schemaless" configuration data
The main purpose of this table will be storing global configuration informtion about the database and its schema.
origin
The origin
table has the following fields:
origin_id
- [Integer][Primary Key]
port
- [Integer][Null] The port as specified in the URL or a well known default
scheme
- [Varchar] The scheme as specified in the URL
host
- [Varchar][Nill] The hostname as specified in the URL, if present
crawl_delay_ms
- [Integer][Null] Delay in milliseconds that the crawler should wait between requests, taken from robots.txt
- Deprecated since 0.0.0: This is crawler specific information that shouldn't be part of the base database.
There is a unique contraint across the fields port
, scheme
and host
.
See the Origin data type for more information.
url
The url
table has the following fields:
url_id
- [Integer][Primary Key]
origin_id
- [Integer] Reference into the
origin
table path
- [Text][Null] Path prt of the URL, if present
query
- [Text][Null] Query part of the URL, if present
username
- [Text][Null] Username part of the URL, if present
password
- [Text][Null] Password part of the URL, if present
str_url
- [Text][Unique] String representation of the whole URL for easy retrieval
The url
table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document.
Fragment fields are stored together with URL ids where appropriate.
See the URL data type for more information.
mimetype
The mimetype
table has the following fields:
mimetype_id
- [Integer][Primary Key]
mime_type
- [Text] The part before the
/
mime_subtype
- [Text] The part after the
/
mime_suffix
- [Text][Null] The part after the
+
if present charset
- [Text][Null] The
charset
parameter. This is in here since it is by fasr the most common and webservers love sending it. str_mimetype
- [Text][Unique] String representation of the whole mimetype with the parameters sorted in alphabetical order.
mime_parameter
The mime_parameter
table has the following fields:
mime_parameter_id
- [Integer][Primary Key] Internal database id for the mime parameter, not used.
mimetype_id
- [Integer] Reference into the
mimetype
table. key
- [Text] Key of the mimetype parameter
value
- [Text] Value of the mimetype parameter
Version history
0.0.0 - The last unversioned
Version 0.0.0 represents the last unversioned version of the database schema.