The crawler database schema is implemented on top of the base database schema and mainly holds data surrounding the crawl log (see also: crawl data tree) and the crawl candidates.
Overview
Tables in the crawler database are:
agent
- Information about crawling agents
crawl_log
- Crawl log entries
request
- Requests that belong to a crawl log entry
file
- File metadata obtained from a request
file_text
- Text content belonging to an entry in the
file
table redirect
- Redirect resulting from a request
crawl_candidate
- A URL that was discovered in a context that makes it a potential crawling target
Tables
agent
The agent
table has the following fields:
agent_id
- Text Primary Key
time_started_unix_utc
- Integer / Timestamp When the agent started its work.
time_finished_unitx_utc
- Integer / Timestamp Null When the agent finished its work.
- Null means that the agent is currently running or was forcefully terminated.
agent_uuid
- UUID External Key
name
-
Text The name of the crawler as specified using the
--worker-name
option. http_user_agent
- Text Null The HTTP user agent that was used by the crawler.
- Null means that the the concept of a user agent isn't applicable to the crawler.
crawl_log
The crawl log table gets written after a crawl command has been finished to log the outcome.
The crawl_log
table stores crawl log entries, it has the following fields:
crawl_log_id
- Integer Primary Key
agent_id
-
Integer Reference to the
agent
table of which agent is responsible for the entry. url_id
-
Integer Reference to the
url
table, which URL was crawled. crawl_type
- Integer / Enumeration Which crawl type resulted in this entry.
crawl_uuid
- UUID External Key
time_started_unix_utc
- Integer / Timestamp When the action that resulted in this entry started.
time_taken_ms
- Integer / Timestamp How long the action took.
exit_code
- Integer / Enumeration Which outcome the action had, see crawl exit code.
message
- Text Null A place to store an error message, this is intended to help humans with debugging.
Indices:
crawl_log_quickinfo
- on fields
url_id
,time_started_unix_utc
,exit_code
- For speeding up querying the last exit code of a URL
request
The request
table has the following fields:
request_id
- Integer Primary Key
crawl_log_id
-
Integer Which entry in the
crawl-log
table this request belongs to. url_id
-
Integer Reference to the [
url
table] which URL was requested (this may be different from the crawled URL, i.e. when scraping a CMS API). time_sent_unix_utc
- Integer / Timestamp When the request was sent.
request_duration_ms
- Integer / Duration Null How long the request took.
- Null means that measuring the time failed.
robotstxt_approved
-
Bool Whether the request was approved by a
robots.txt
file. exit_code
- Integer / Enumeration The crawl exit code of this single request.
server_last_modified_unix_utc
- Integer / Timestamp Null When the server claimed that the file was last modified.
- Null means that the header information about the last modification time was missing.
http_status_code
- Integer / Enumeration Null The HTTP Status Code that this request resulted in.
- Null means that no HTTP response was received.
http_etag
-
Varchar Null The the content of the
ETag
header. - Null means that no ETag header was received.
file
Note: The file
and redirect
tables have the following fields in common: crawl_log_id
, request_id
, url_id
, last_modified_unix_utc
.
The file
table has the following fields:
file_id
- Integer Primary Key
crawl_log_id
-
Integer Reference to the
crawl_log
table of which crawl log entry resulted in this file. If a request is linked the request must be from the same crawl log entry. request_id
-
Integer Null Reference to the
request
table from which request this file metadata came from. - Null means that the file didn't result from a network request. (i.e. by reading from a dump)
url_id
-
Integer Reference to the
url
table, which URL this file is associated with. last_modified_unix_utc
- Integer / Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.
- Null means that the file didn't contain any readable metadata on its last modification date.
file_size
- Integer Null The file size in bytes
- Null means that the file size is unknown (i.e. the file wasn't fetched and no metadata was present)
mimetype_id
-
Integer Reference to the
mimetype
table, which MIME-Type (Media Type) the file has. canonical_url_id
-
Integer Null Reference to the
url
table, which URL the file claims its canonical version to be at. (See also RFC 6596) - Null means that the file didn't claim the be the non-canonical version of another resource.
file_text
The file_text
table has the following fields:
file_id
-
Integer Primary Key Reference to the
file
table entry that holds the metadata text
- Text Unparsed text content of the file translated to UTF-8.
redirect
Note: The file
and redirect
tables have the following fields in common: crawl_log_id
, request_id
, url_id
, last_modified_unix_utc
.
The redirect
table has the following fields:
redirect_id
- Integer Primary Key
crawl_log_id
-
Integer Reference to the
crawl_log
table of which crawl log entry resulted in this redirect. If a request is linked the request must be from the same crawl log entry. request_id
-
Integer Null Reference to the
request
table from which request this redirect came from. - Null means that the redirect didn't result from a network request. (i.e. by reading from a dump)
url_id
-
Integer Reference to the
url
table, which URL this redirect is associated with. last_modified_unix_utc
- Integer / Timestamp Null When the file was last modified according to file metadata, archive metadata or response headers, the data here should be from the most reliable source.
- Null means that the file didn't contain any readable metadata on its last modification date.
to_url_id
-
Integer Reference to the
url
table, which URL the redirect points at. information_source
- Integer / Enumeration Where the information for this redirect came from, see Information Source for possible values.
is_permanent
- Bool Whether one can expect future requests to result in the same redirect.
by_security_policy
-
Bool Whether the redirect was because of a security policy (i.e. an automatic
http
tohttps
upgrade). - 🔧 This field is currently unused and may be removed in the future. (See Possible future changes).
crawl_candidate
Note: This table contains information that is also present in the crawl log. This is to make sure that the information doesn't get lost when the crawl log gets cleaned up.
This table attached crawling specific metadata to URLs.
The crawl_candidate
table contains the following fields:
url_id
-
Integer Primary Key Reference to the
url
table of the URL that this metadata is for. last_crawl_time_unix_utc
- Integer / Timestamp Null When the URL was last crawled
- Null means that the URL has only been discovered as crawlable, but not been crawled yet.
last_crawl_exit_code
- Integer / Timestamp Null The crawl exit code of the last crawl.
- Null means that the exit code is unavailable.
last_contentful_crawl_time_unix_utc
- Integer / Timestamp Null The last time the crawler exited with a code from the contentful category of exit codes.
- Null means that there never was a crawl with a contentful exit code.
last_contentful_http_etag
-
Text Null Set together with
last_contentful_crawl_time_unix_utc
, this will contain the ETag header that was returned on the last contentful crawl. - Null means that there was no last contentful crawl or the last contentful crawl didn't have a ETag header set.
Possible future changes
- Addition of UUID fields to the tables
request
,file
andredirect
- The
redirect
.by_security_policy
field could be integrated into theinformation_source
. It is currently never set to true by the crawler.
Version history
1.0.0 - De-facto stable
This schema has been de-facto stable for a while and been assigned the 1.0.0 version with the introduction of database versioning.