Data: Crawler Database

The crawler database schema is implemented on top of the base database schema and mainly holds data surrounding the crawl log (see also: crawl data tree) and the crawl candidates.

Overview

Tables in the crawler database are:

agent
Information about crawling agents
crawl_log
Crawl log entries
request
Requests that belong to a crawl log entry
file
File metadata obtained from a request
file_text
Text content belonging to an entry in the file table
redirect
Redirect resulting from a request
crawl_candidate
A URL that was discovered in a context that makes it a potential crawling target

Tables

agent

The agent table has the following fields:

agent_id
Text Primary Key
time_started_unix_utc
Integer / Timestamp When the agent started its work.
time_finished_unitx_utc
Integer / Timestamp Null When the agent finished its work.
Null means that the agent is currently running or was forcefully terminated.
agent_uuid
UUID External Key
name
Text The name of the crawler as specified using the --worker-name option.
http_user_agent
Text Null The HTTP user agent that was used by the crawler.
Null means that the the concept of a user agent isn't applicable to the crawler.

crawl_log

The crawl log table gets written after a crawl command has been finished to log the outcome.

The crawl_log table stores crawl log entries, it has the following fields:

crawl_log_id
Integer Primary Key
agent_id
Integer Reference to the agent table of which agent is responsible for the entry.
url_id
Integer Reference to the url table, which URL was crawled.
crawl_type
Integer / Enumeration Which crawl type resulted in this entry.
crawl_uuid
UUID External Key
time_started_unix_utc
Integer / Timestamp When the action that resulted in this entry started.
time_taken_ms
Integer / Timestamp How long the action took.
exit_code
Integer / Enumeration Which outcome the action had, see crawl exit code.
message
Text Null A place to store an error message, this is intended to help humans with debugging.

Indices:

crawl_log_quickinfo
on fields url_id, time_started_unix_utc, exit_code
For speeding up querying the last exit code of a URL

request

The request table has the following fields:

request_id
Integer Primary Key
crawl_log_id
Integer Which entry in the crawl-log table this request belongs to.
url_id
Integer Reference to the [url table] which URL was requested (this may be different from the crawled URL, i.e. when scraping a CMS API).
time_sent_unix_utc
Integer / Timestamp When the request was sent.
request_duration_ms
Integer / Duration Null How long the request took.
Null means that measuring the time failed.
robotstxt_approved
Bool Whether the request was approved by a robots.txt file.
exit_code
Integer / Enumeration The crawl exit code of this single request.
server_last_modified_unix_utc
Integer / Timestamp Null When the server claimed that the file was last modified.
Null means that the header information about the last modification time was missing.
http_status_code
Integer / Enumeration Null The HTTP Status Code that this request resulted in.
Null means that no HTTP response was received.
http_etag
Varchar Null The the content of the ETag header.
Null means that no ETag header was received.

file

Note: The file and redirect tables have the following fields in common: crawl_log_id, request_id, url_id, last_modified_unix_utc.

The file table has the following fields:

file_id
Integer Primary Key
crawl_log_id
Integer Reference to the crawl_log table of which crawl log entry resulted in this file. If a request is linked the request must be from the same crawl log entry.
request_id
Integer Null Reference to the request table from which request this file metadata came from.
Null means that the file didn't result from a network request. (i.e. by reading from a dump)
url_id
Integer Reference to the url table, which URL this file is associated with.
last_modified_unix_utc
Integer / Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.
Null means that the file didn't contain any readable metadata on its last modification date.
file_size
Integer Null The file size in bytes
Null means that the file size is unknown (i.e. the file wasn't fetched and no metadata was present)
mimetype_id
Integer Reference to the mimetype table, which MIME-Type (Media Type) the file has.
canonical_url_id
Integer Null Reference to the url table, which URL the file claims its canonical version to be at. (See also RFC 6596)
Null means that the file didn't claim the be the non-canonical version of another resource.

file_text

The file_text table has the following fields:

file_id
Integer Primary Key Reference to the file table entry that holds the metadata
text
Text Unparsed text content of the file translated to UTF-8.

redirect

Note: The file and redirect tables have the following fields in common: crawl_log_id, request_id, url_id, last_modified_unix_utc.

The redirect table has the following fields:

redirect_id
Integer Primary Key
crawl_log_id
Integer Reference to the crawl_log table of which crawl log entry resulted in this redirect. If a request is linked the request must be from the same crawl log entry.
request_id
Integer Null Reference to the request table from which request this redirect came from.
Null means that the redirect didn't result from a network request. (i.e. by reading from a dump)
url_id
Integer Reference to the url table, which URL this redirect is associated with.
last_modified_unix_utc
Integer / Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.
Null means that the file didn't contain any readable metadata on its last modification date.
to_url_id
Integer Reference to the url table, which URL the redirect points at.
information_source
Integer / Enumeration Where the information for this redirect came from, see Information Source for possible values.
is_permanent
Bool Whether one can expect future requests to result in the same redirect.
by_security_policy
Bool Whether the redirect was because of a security policy (i.e. an automatic http to https upgrade).
🔧 This field is currently unused and may be removed in the future. (See Possible future changes).

crawl_candidate

Note: This table contains information that is also present in the crawl log. This is to make sure that the information doesn't get lost when the crawl log gets cleaned up.

This table attached crawling specific metadata to URLs.

The crawl_candidate table contains the following fields:

url_id
Integer Primary Key Reference to the url table of the URL that this metadata is for.
last_crawl_time_unix_utc
Integer / Timestamp Null When the URL was last crawled
Null means that the URL has only been discovered as crawlable, but not been crawled yet.
last_crawl_exit_code
Integer / Timestamp Null The crawl exit code of the last crawl.
Null means that the exit code is unavailable.
last_contentful_crawl_time_unix_utc
Integer / Timestamp Null The last time the crawler exited with a code from the contentful category of exit codes.
Null means that there never was a crawl with a contentful exit code.
last_contentful_http_etag
Text Null Set together with last_contentful_crawl_time_unix_utc, this will contain the ETag header that was returned on the last contentful crawl.
Null means that there was no last contentful crawl or the last contentful crawl didn't have a ETag header set.

Possible future changes

Version history

1.0.0 - De-facto stable

This schema has been de-facto stable for a while and been assigned the 1.0.0 version with the introduction of database versioning.