Crawler schema as of release 3.0

The summary database scheme is implemented on top of the base database schema and mainly houses the entity data tree. It is mainly built by the summarizes algorithm.

Overview

Tables in the crawler database are:

agent
Information about crawling agents.
crawl_log
The crawl log table gets written after a crawl command has been finished to log the outcome as a crawl log entry
request
(Network) Requests that belong to a crawl log entry.
file
File metadata obtained from a request.
file_text
Text content belonging to an entry in the file table.
redirect
Redirect resulting from a request.
crawl_candidate
A URL that was discovered in a context that makes it a potential crawling target.

Tables

agent

Information about crawling agents.

The agent table has the following fields:

agent_id
Integer Primary Key
time_started_unix_utc
Integer/Timestamp When the agent started its work.
time_finished_unix_utc
Integer/Timestamp Null When the agent finished its work.
Null means that the agent is currently running or was forcefully terminated.
agent_uuid
UUID External Key A random UUID to cross correlate the agent.
name
Text The name of the crawler as specified using the --worker-name option.
http_user_agent
Text Null The HTTP user agent that was used by the crawler.
Null means that the the concept of a user agent isn't applicable to the crawler.

crawl_log

The crawl log table gets written after a crawl command has been finished to log the outcome as a crawl log entry

The crawl_log table has the following fields:

crawl_log_id
Integer Primary Key
agent_id
Integer Reference to the agent table of which agent is responsible for the entry.
url_id
Integer Reference to the url table, which URL was crawled.
crawl_type
Integer/Enumeration Which crawl type resulted in this entry.
crawl_uuid
UUID External Key
time_started_unix_utc
Integer/Timestamp When the action that resulted in this entry started.
time_taken_ms
Integer How long the action took in whole milliseconds.
exit_code
Integer/Enumeration Which outcome the action had, see crawl exit code.
message
Text Null A place to store an error message, this is intended to help humans with debugging.
Null means that no information that isn't already encoded in other metadata is available.

Indices:

crawl_log_quickinfo
On fields url_id, time_started_unix_utc and exit_code
For speeding up querying the last exit code of a URL

request

(Network) Requests that belong to a crawl log entry.

The request table has the following fields:

request_id
Integer Primary Key
crawl_log_id
Integer Which entry in the crawl-log table this request belongs to.
url_id
Integer Reference to the [url table] which URL was requested (this may be different from the crawled URL, i.e. when scraping a CMS API).
time_sent_unix_utc
Integer/Timestamp When the request was sent.
request_duration_ms
Integer Null How long the request took.
Null means that measureing how long the request took failed.
robotstxt_approved
Bool Whether the request was approved by a robots.txt file.
exit_code
Integer/Enumeration The crawl exit code of this single request.
exit_code
Integer/Enumeration The crawl exit code of this single request.
server_last_modified_unix_utc
Integer/Timestamp Null When the server claimed that the file was last modified.
Null means Null means that the header information about the last modification time was missing.
http_status_code
Integer/Enumeration Null The HTTP Status Code that this request resulted in.
Null means that no HTTP response was received.
http_etag
Varchar Null The the content of the ETag header.
Null means that no ETag header was received.

file

File metadata obtained from a request.

The file table has the following fields:

file_id
Integer Primary Key
crawl_log_id
Integer Reference to the crawl_log table of which crawl log entry resulted in this file. If a request is linked the request must be from the same crawl log entry.
request_id
Integer Null Reference to the request table from which request this file metadata came from.
Null means that the file didn't result from a network request. (i.e. by reading from a dump)
url_id
Integer Reference to the url table, which URL this file is associated with.
last_modified_unix_utc
Integer/Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.
Null means that the file didn't contain any readable metadata on its last modification date.
file_size
Integer Null The file size in bytes.
Null means that the file size is unknown (i.e. the file wasn't fetched and no metadata was present)
mimetype_id
Integer Reference to the mimetype table, which MIME-Type (Media Type) the file has.
canonical_url_id
Integer Null Reference to the url table, which URL the file claims its canonical version to be at. (See also RFC 6596)
Null means that the file didn't claim the be the non-canonical version of another resource.

Note: The file and redirect tables have the following fields in common: crawl_log_id, request_id, url_id, last_modified_unix_utc.

file_text

Text content belonging to an entry in the file table.

The file_text table has the following fields:

file_id
Integer Primary Key Reference to the file table entry that holds the metadata
text
Text Unparsed text content of the file translated to UTF-8.

redirect

Redirect resulting from a request.

The redirect table has the following fields:

redirect_id
Integer Primary Key
crawl_log_id
Integer Reference to the crawl_log table of which crawl log entry resulted in this file. If a request is linked the request must be from the same crawl log entry.
request_id
Integer Null Reference to the request table from which request this file metadata came from.
Null means that the file didn't result from a network request. (i.e. by reading from a dump)
url_id
Integer Reference to the url table, which URL this file is associated with.
last_modified_unix_utc
Integer/Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.
Null means that the file didn't contain any readable metadata on its last modification date.
to_url_id
Integer Reference to the url table, which URL the redirect points at.
information_source
Integer/Enumeration Where the information for this redirect came from, see Information Source for possible values.
is_permanent
Bool Whether one can expect future requests to result in the same redirect.
by_security_policy
Bool Whether the redirect was because of a security policy (i.e. an automatic http to https upgrade).
🔧 This field is currently unused and may be removed in the future. (See Possible future changes).

Note: The file and redirect tables have the following fields in common: crawl_log_id, request_id, url_id, last_modified_unix_utc.

crawl_candidate

A URL that was discovered in a context that makes it a potential crawling target.

The crawl_candidate table has the following fields:

url_id
Integer Primary Key Reference to the url table of the URL that this metadata is for.
last_crawl_time_unix_utc
Integer/Timestamp Null When the URL was last crawled.
Null means that the URL has only been discovered as crawlable, but not been crawled yet.
last_crawl_exit_code
Integer/Enumeration Null The crawl exit code of the last crawl.
Null means hat the exit code is unavailable.
last_contentful_crawl_time_unix_utc
Integer/Timestamp Null The last time the crawler exited with a code from the contentful category of exit codes.
Null means that there never was a crawl with a contentful exit code.
last_contentful_http_etag
Text Null Set together with last_contentful_crawl_time_unix_utc, this will contain the ETag header that was returned on the last contentful crawl.
Null means that there was no last contentful crawl or the last contentful crawl didn't have a ETag header set.

Note: This table contains information that is also present in the crawl log. This is to make sure that the information doesn't get lost when the crawl log gets cleaned up.