Crawler schema as of release 3.0

The summary database scheme is implemented on top of the base database schema and mainly houses the entity data tree. It is mainly built by the summarizes algorithm.

Overview

Tables in the crawler database are:

agent: Information about crawling agents.
crawl_log: The crawl log table gets written after a crawl command has been finished to log the outcome as a crawl log entry
request: (Network) Requests that belong to a crawl log entry.
file: File metadata obtained from a request.
file_text: Text content belonging to an entry in the file table.
redirect: Redirect resulting from a request.
crawl_candidate: A URL that was discovered in a context that makes it a potential crawling target.

Tables

`agent`

Information about crawling agents.

The agent table has the following fields:

agent_id: Integer Primary Key
time_started_unix_utc: Integer/Timestamp When the agent started its work.
time_finished_unix_utc: Integer/Timestamp Null When the agent finished its work.; Null means that the agent is currently running or was forcefully terminated.
agent_uuid: UUID External Key A random UUID to cross correlate the agent.
name: Text The name of the crawler as specified using the --worker-name option.
http_user_agent: Text Null The HTTP user agent that was used by the crawler.; Null means that the the concept of a user agent isn't applicable to the crawler.

`crawl_log`

The crawl log table gets written after a crawl command has been finished to log the outcome as a crawl log entry

The crawl_log table has the following fields:

crawl_log_id: Integer Primary Key
agent_id: Integer Reference to the agent table of which agent is responsible for the entry.
url_id: Integer Reference to the url table, which URL was crawled.
crawl_type: Integer/Enumeration Which crawl type resulted in this entry.
crawl_uuid: UUID External Key
time_started_unix_utc: Integer/Timestamp When the action that resulted in this entry started.
time_taken_ms: Integer How long the action took in whole milliseconds.
exit_code: Integer/Enumeration Which outcome the action had, see crawl exit code.
message: Text Null A place to store an error message, this is intended to help humans with debugging.; Null means that no information that isn't already encoded in other metadata is available.

Indices:

crawl_log_quickinfo: On fields url_id, time_started_unix_utc and exit_code; For speeding up querying the last exit code of a URL

`request`

(Network) Requests that belong to a crawl log entry.

The request table has the following fields:

request_id: Integer Primary Key
crawl_log_id: Integer Which entry in the crawl-log table this request belongs to.
url_id: Integer Reference to the [url table] which URL was requested (this may be different from the crawled URL, i.e. when scraping a CMS API).
time_sent_unix_utc: Integer/Timestamp When the request was sent.
request_duration_ms: Integer Null How long the request took.; Null means that measureing how long the request took failed.
robotstxt_approved: Bool Whether the request was approved by a robots.txt file.
exit_code: Integer/Enumeration The crawl exit code of this single request.
exit_code: Integer/Enumeration The crawl exit code of this single request.
server_last_modified_unix_utc: Integer/Timestamp Null When the server claimed that the file was last modified.; Null means Null means that the header information about the last modification time was missing.
http_status_code: Integer/Enumeration Null The HTTP Status Code that this request resulted in.; Null means that no HTTP response was received.
http_etag: Varchar Null The the content of the ETag header.; Null means that no ETag header was received.

`file`

File metadata obtained from a request.

The file table has the following fields:

file_id: Integer Primary Key
crawl_log_id: Integer Reference to the crawl_log table of which crawl log entry resulted in this file. If a request is linked the request must be from the same crawl log entry.
request_id: Integer Null Reference to the request table from which request this file metadata came from.; Null means that the file didn't result from a network request. (i.e. by reading from a dump)
url_id: Integer Reference to the url table, which URL this file is associated with.
last_modified_unix_utc: Integer/Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.; Null means that the file didn't contain any readable metadata on its last modification date.
file_size: Integer Null The file size in bytes.; Null means that the file size is unknown (i.e. the file wasn't fetched and no metadata was present)
mimetype_id: Integer Reference to the mimetype table, which MIME-Type (Media Type) the file has.
canonical_url_id: Integer Null Reference to the url table, which URL the file claims its canonical version to be at. (See also RFC 6596); Null means that the file didn't claim the be the non-canonical version of another resource.

Note: The file and redirect tables have the following fields in common: crawl_log_id, request_id, url_id, last_modified_unix_utc.

`file_text`

Text content belonging to an entry in the file table.

The file_text table has the following fields:

file_id: Integer Primary Key Reference to the file table entry that holds the metadata
text: Text Unparsed text content of the file translated to UTF-8.

`redirect`

Redirect resulting from a request.

The redirect table has the following fields:

redirect_id: Integer Primary Key
crawl_log_id: Integer Reference to the crawl_log table of which crawl log entry resulted in this file. If a request is linked the request must be from the same crawl log entry.
request_id: Integer Null Reference to the request table from which request this file metadata came from.; Null means that the file didn't result from a network request. (i.e. by reading from a dump)
url_id: Integer Reference to the url table, which URL this file is associated with.
last_modified_unix_utc: Integer/Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.; Null means that the file didn't contain any readable metadata on its last modification date.
to_url_id: Integer Reference to the url table, which URL the redirect points at.
information_source: Integer/Enumeration Where the information for this redirect came from, see Information Source for possible values.
is_permanent: Bool Whether one can expect future requests to result in the same redirect.
by_security_policy: Bool Whether the redirect was because of a security policy (i.e. an automatic http to https upgrade).; 🔧 This field is currently unused and may be removed in the future. (See Possible future changes).

Note: The file and redirect tables have the following fields in common: crawl_log_id, request_id, url_id, last_modified_unix_utc.

`crawl_candidate`

A URL that was discovered in a context that makes it a potential crawling target.

The crawl_candidate table has the following fields:

url_id: Integer Primary Key Reference to the url table of the URL that this metadata is for.
last_crawl_time_unix_utc: Integer/Timestamp Null When the URL was last crawled.; Null means that the URL has only been discovered as crawlable, but not been crawled yet.
last_crawl_exit_code: Integer/Enumeration Null The crawl exit code of the last crawl.; Null means hat the exit code is unavailable.
last_contentful_crawl_time_unix_utc: Integer/Timestamp Null The last time the crawler exited with a code from the contentful category of exit codes.; Null means that there never was a crawl with a contentful exit code.
last_contentful_http_etag: Text Null Set together with last_contentful_crawl_time_unix_utc, this will contain the ETag header that was returned on the last contentful crawl.; Null means that there was no last contentful crawl or the last contentful crawl didn't have a ETag header set.

Note: This table contains information that is also present in the crawl log. This is to make sure that the information doesn't get lost when the crawl log gets cleaned up.