Overview
Tables in the base database are:
unobtanium_database_info- Global configuration information about the database and its schema.
- Since
0.1.0 origin- URL Origins, queryable by component.
url- The url table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document. Fragment fields are stored together with URL ids where appropriate.
mimetype- Mimetypes/Mediatypes
mime_parameter- Extends the
mimetypetable with SQL queryable parameter information.
Tables
unobtanium_database_info
Global configuration information about the database and its schema.
The unobtanium_database_info table has the following fields:
origin
URL Origins, queryable by component.
The origin table has the following fields:
origin_id- Integer Primary Key
port- Integer Null The port as specified in the URL.
- Null means a well known default.
scheme- Varchar The scheme as specified in the URL.
host- Varchar Null The hostname as specified in the URL.
- Null means that no hostname was part of the URL.
str_origin-
Text Unique The origin as a string, as if it was an URL consisting only of
scheme,hostandport, main purpose is deduplication. crawl_delay_ms-
Integer Delay in milliseconds that the crawler should wait between requests, taken from
robots.txt. - 🔧 Deprecated since 0.0.0: This is crawler specific information that shouldn't be part of the base database.
- Removed in
0.1.0
Constraints:
Unique on(Removed inport,hostandscheme0.1.0)
url
The url table stores URLs that point to documents, notably it does not store the fragment part of an URL, since that points to somewhere inside a document. Fragment fields are stored together with URL ids where appropriate.
The url table has the following fields:
url_id- Integer Primary Key
origin_id- Integer Reference into the origin table. Origin of the URL.
path- Text Null Path part of the URL.
- Null means that no path was present.
query- Text Null Query part of the URL.
- Null means that no query was present.
username- Text Null Username part of the URL.
- Null means that no username was present.
password- Text Null Password part of the URL.
- Null means that no password was present.
str_url- Text Unique String representation of the whole URL for easy retrieval.
mimetype
Mimetypes/Mediatypes
The mimetype table has the following fields:
mimetype_id- Integer Primary Key
mime_type-
Text The part before the
/. mime_subtype-
Text Teh part after the
/. mime_suffix-
Text Null The part after the
+. - Null means that this part is not present
charset-
Text Null The
charsetparameter. This is in here since it is by far the most common and webservers love sending it. - Null means that no charset parameter was present.
str_mimetype- Text Unique String representation of the whole mimetype with the parameters sorted in alphabetical order.
mime_parameter
Extends the mimetype table with SQL queryable parameter information.
The mime_parameter table has the following fields: