Crawler Database

The crawler database schema is implemented on top of the base database schema and mainly holds data surrounding the crawl log (see also: crawl data tree) and the crawl candidates.

Overview

Tables in the crawler database are:

agent: Information about crawling agents
crawl_log: Crawl log entries
request: Requests that belong to a crawl log entry
file: File metadata obtained from a request
file_text: Text content belonging to an entry in the file table
redirect: Redirect resulting from a request
crawl_candidate: A URL that was discovered in a context that makes it a potential crawling target

Tables

`agent`

The agent table has the following fields:

agent_id: Text Primary Key
time_started_unix_utc: Integer / Timestamp When the agent started its work.
time_finished_unitx_utc: Integer / Timestamp Null When the agent finished its work.; Null means that the agent is currently running or was forcefully terminated.
agent_uuid: UUID External Key
name: Text The name of the crawler as specified using the --worker-name option.
http_user_agent: Text Null The HTTP user agent that was used by the crawler.; Null means that the the concept of a user agent isn't applicable to the crawler.

`crawl_log`

The crawl log table gets written after a crawl command has been finished to log the outcome.

The crawl_log table stores crawl log entries, it has the following fields:

crawl_log_id: Integer Primary Key
agent_id: Integer Reference to the agent table of which agent is responsible for the entry.
url_id: Integer Reference to the url table, which URL was crawled.
crawl_type: Integer / Enumeration Which crawl type resulted in this entry.
crawl_uuid: UUID External Key
time_started_unix_utc: Integer / Timestamp When the action that resulted in this entry started.
time_taken_ms: Integer / Timestamp How long the action took.
exit_code: Integer / Enumeration Which outcome the action had, see crawl exit code.
message: Text Null A place to store an error message, this is intended to help humans with debugging.

Indices:

crawl_log_quickinfo: on fields url_id, time_started_unix_utc, exit_code; For speeding up querying the last exit code of a URL

`request`

The request table has the following fields:

request_id: Integer Primary Key
crawl_log_id: Integer Which entry in the crawl-log table this request belongs to.
url_id: Integer Reference to the [url table] which URL was requested (this may be different from the crawled URL, i.e. when scraping a CMS API).
time_sent_unix_utc: Integer / Timestamp When the request was sent.
request_duration_ms: Integer / Duration Null How long the request took.; Null means that measuring the time failed.
robotstxt_approved: Bool Whether the request was approved by a robots.txt file.
exit_code: Integer / Enumeration The crawl exit code of this single request.
server_last_modified_unix_utc: Integer / Timestamp Null When the server claimed that the file was last modified.; Null means that the header information about the last modification time was missing.
http_status_code: Integer / Enumeration Null The HTTP Status Code that this request resulted in.; Null means that no HTTP response was received.
http_etag: Varchar Null The the content of the ETag header.; Null means that no ETag header was received.

`file`

Note: The file and redirect tables have the following fields in common: crawl_log_id, request_id, url_id, last_modified_unix_utc.

The file table has the following fields:

file_id: Integer Primary Key
crawl_log_id: Integer Reference to the crawl_log table of which crawl log entry resulted in this file. If a request is linked the request must be from the same crawl log entry.
request_id: Integer Null Reference to the request table from which request this file metadata came from.; Null means that the file didn't result from a network request. (i.e. by reading from a dump)
url_id: Integer Reference to the url table, which URL this file is associated with.
last_modified_unix_utc: Integer / Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.; Null means that the file didn't contain any readable metadata on its last modification date.
file_size: Integer Null The file size in bytes; Null means that the file size is unknown (i.e. the file wasn't fetched and no metadata was present)
mimetype_id: Integer Reference to the mimetype table, which MIME-Type (Media Type) the file has.
canonical_url_id: Integer Null Reference to the url table, which URL the file claims its canonical version to be at. (See also RFC 6596); Null means that the file didn't claim the be the non-canonical version of another resource.

`file_text`

The file_text table has the following fields:

file_id: Integer Primary Key Reference to the file table entry that holds the metadata
text: Text Unparsed text content of the file translated to UTF-8.

`redirect`

Note: The file and redirect tables have the following fields in common: crawl_log_id, request_id, url_id, last_modified_unix_utc.

The redirect table has the following fields:

redirect_id: Integer Primary Key
crawl_log_id: Integer Reference to the crawl_log table of which crawl log entry resulted in this redirect. If a request is linked the request must be from the same crawl log entry.
request_id: Integer Null Reference to the request table from which request this redirect came from.; Null means that the redirect didn't result from a network request. (i.e. by reading from a dump)
url_id: Integer Reference to the url table, which URL this redirect is associated with.
last_modified_unix_utc: Integer / Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.; Null means that the file didn't contain any readable metadata on its last modification date.
to_url_id: Integer Reference to the url table, which URL the redirect points at.
information_source: Integer / Enumeration Where the information for this redirect came from, see Information Source for possible values.
is_permanent: Bool Whether one can expect future requests to result in the same redirect.
by_security_policy: Bool Whether the redirect was because of a security policy (i.e. an automatic http to https upgrade).; 🔧 This field is currently unused and may be removed in the future. (See Possible future changes).

`crawl_candidate`

Note: This table contains information that is also present in the crawl log. This is to make sure that the information doesn't get lost when the crawl log gets cleaned up.

This table attached crawling specific metadata to URLs.

The crawl_candidate table contains the following fields:

url_id: Integer Primary Key Reference to the url table of the URL that this metadata is for.
last_crawl_time_unix_utc: Integer / Timestamp Null When the URL was last crawled; Null means that the URL has only been discovered as crawlable, but not been crawled yet.
last_crawl_exit_code: Integer / Timestamp Null The crawl exit code of the last crawl.; Null means that the exit code is unavailable.
last_contentful_crawl_time_unix_utc: Integer / Timestamp Null The last time the crawler exited with a code from the contentful category of exit codes.; Null means that there never was a crawl with a contentful exit code.
last_contentful_http_etag: Text Null Set together with last_contentful_crawl_time_unix_utc, this will contain the ETag header that was returned on the last contentful crawl.; Null means that there was no last contentful crawl or the last contentful crawl didn't have a ETag header set.

Possible future changes

Addition of UUID fields to the tables request, file and redirect
The redirect.by_security_policy field could be integrated into the information_source. It is currently never set to true by the crawler.

Version history

1.0.0 - De-facto stable

This schema has been de-facto stable for a while and been assigned the 1.0.0 version with the introduction of database versioning.