The summary database scheme is implemented on top of the base database schema and mainly houses the entity data tree. It is mainly built by the summarizes algorithm.
Overview
Tables in the base database are:
entity_generation
- The root of all summary data, spacetime coordinates for websites. See entity generation.
duplicate_summary
- Information about exact duplicates.
crawl_summary
- Stores the crawl summaries that is derived from the crawl log.
http_summary
- HTTP extension for the
crawl_summary
table. file_summary
-
File metadata derived from the
file
table in the crawler database. redirect_summary
-
Redirect metadata derived from the
redirect
table in the crawler database. link_summary
- Lists parsed links in files.
document_description
- Document metadata.
text_pile
- Collection of document text indexed by hash, see text pile.
token
- Enumerates all possible tokens that can make up a document in this database.
token_statistics
- Maps tokens to text piles, also referred to as the token index, this is experimental.
Tables
entity_generation
The entity_generation
table stores entity generations, it has the following fields:
entity_generation_id
- Integer Primary Key
entity_generation_uuid
- UUID External Key
url_id
-
Integer Reference to the
url
table, which URL this entity generation is about. url_fragment
- Text Null The id of the document element this entity generation is about.
- Null means that this entity generation is about the whole document
first_seen_unix_utc
- Integer / Timestamp First known existence of this generation (may change if better data is integrated)
last_seen_unix_utc
-
Integer / Timestamp Last known existence, may equal
first_seen_unix_utc
. confirmed_end_unix_utc
-
Integer / Timestamp Null The time the entity generation is known to be closed. (i.e. the
first_seen
of the next entity generation) - Null means that the entity generation is still open.
marked_duplicate
-
Bool Whether this is marked as a duplicate in the
duplicate_summary
table. This is a cache value intended to accelerate queries by requiring less joining. text_pile_id
-
Integer Null Reference to the
text_pile
table, that references the text content that this entity generation represents. - Null means that this entity generation does not have any searchable text attached.
There is a check that first_seen_unix_utc <= last_seen_unix_utc
.
Indices:
entity_generation_by_uuid
- On field
entity_generation_uuid
- Speeds up looking for entity generations by UUID.
entity_generation_by_first_seen
- On fields
url_id
,url_fragment
andfirst_seen_unix_utc
- Speeds up looking for entity generations based on when they were first seen for a given URL.
entity_generation_by_text_pile
- On fields
text_pile_id
andconfirmed_end_unix_utc
- Speeds up looking for non closed entity generations given a
text_pile_id
, this is needed for searching. entity_generation_by_active
- On fields
marked_duplicate
,confirmed_end_unix_utc
- Speeds up querying for (non) duplicates.
duplicate_summary
Note: There is no way to directly address the duplicate summaries, they are always referenced as being the attachments of an entity generation.
The duplicate_summary
table has the following fields:
subject_entity_generation_id
-
Integer Reference to the
entity_generation
table, of the entity generation that is flagged as duplicate. duplicate_of_entity_generation_id
-
Integer Reference to the
entity_generation
table, of the entity generation that the flagged entity generation is a duplicate of. duplicate_status_start_unix_utc
- Integer / Timestamp When the duplicate status started (usually the time it was detected)
duplicate_status_end_unix_utc
- Integer / Timestamp Null When the duplicate status ended
- Null means that the duplicate status hasn't ended yet.
Indices:
duplicate_summary_by_duplicate
- On fields
subject_entity_generation_id
andduplicate_status_end_unix_utc
- Speeds up looking if some entity generation is a non-expired duplicate.
crawl_summary
The crawl_summary
table has the following fields:
crawl_summary_id
- Integer Primary Key
entity_generation_id
-
Integer Reference to the [
entity_generation
table], of the entity generation that this crawl resulted in. was_robotstxt_approved
-
Bool Whether the crawl was approved by a
robots.txt
file. crawl_type
- Integer / Enumeration Which crawl type resulted in this entry.
crawl_uuid
- UUID The UUID of the corresponding crawl log entry.
agent_uuid
- UUID The UUID of the crawl agent that made the requests.
time_started_unix_utc
- Integer / Timestamp The time when the crawl action was started.
exit_code
- Integer / Enumeration The crawl exit code of the crawl action.
request_duration_ms
- Integer / Timestamp Null How long it took to execute the crawl action.
- Null means that the measurement failed.
Indices:
crawl_summary_by_crawl_uuid
- On field
crawl_uuid
- Speeds up looking for already integrated crawls via
test_has_crawl_summary_with_uuid_bulk()
.
http_summary
An entry in this table is only present if the crawl resulted in a HTTP response.
The http_summary
table has the following fields:
crawl_summary_id
-
Integer Primary Key Reference to the
crawl_summary
table of the extended crawl summary. status_code
- Integer / Enumeration The resulting HTTP status code.
etag
- Text Null The resulting HTTP ETag header.
- Null means that no ETag header was returned.
file_summary
The file_summary
table has the following fields:
entity_generation_id
-
Integer Primary KeyUnique Reference to the
entity_generation
table of the extended entity generation. (There can only be one file per entity generation) file_size
- Integer Null The file size in bytes
- Null means that the file size is not known.
mimetype_id
-
Integer Null Reference to the
mimetype
table of the files mimetype. - Null means that the mimetype is not known (i.e. when there was a conflict detected between claimed mimetype and file content)
canonical_url_id
-
Integer Null Reference to the
url
table, which URL the file claims its canonical version to be at. - Null means that the file didn't claim the be the non-canonical version of another resource.
redirect_summary
The redirect_summary
table has the following fields:
entity_generation_id
-
Integer Primary Key Reference to the
entity_generation
table of the extended entity generation. (There can only be one redirect per entity generation) to_url_id
-
Integer Reference to the
url
table, which URL the redirect points at. information_source
- Integer / Enumeration Where the information for this redirect came from, see Information Source for possible values.
by_security_policy
- Bool Whether the redirect was because of a security policy.
- 🔧 This field is currently unused and may be removed in the future.
to_url_fragment
- Text Null Fragment part of the URL that the redirect points at.
- Null means that the destination URL has no fragment part.
- 🔧 This field is currently unused as there is no equivalent in the crawler database, it could be populated from HTML redirects.
link_summary
The link_summary
table has the following fields:
link_summary_id
- Integer Primary Key
entity_generation_id
-
Integer Primary Key Reference to the
entity_generation
table of the entity generation this link is part of. link_to_url
-
Integer Reference to the
url
table, which URL the link points at. link_to_fragment
- Text Null Fragment part of the URL that the link points at.
- Null means that the destination URL has no fragment part.
rel_nofollow
-
Bool Whether the link was tagged with a
rel="nofollow"
rel_me
-
Bool Whether the link was tagged with a
rel="me"
rel_tag
-
Bool Whether the link was tagged with a
rel="tag"
in_header
- Bool Part of the location signature.
in_footer
- Bool Part of the location signature.
in_aside
- Bool Part of the location signature.
in_nav
- Bool Part of the location signature.
in_form
- Bool Part of the location signature.
in_main
- Bool Part of the location signature.
in_article
- Bool Part of the location signature.
in_section
- Bool Part of the location signature.
in_table
- Bool Part of the location signature.
in_figure
- Bool Part of the location signature.
in_address
- Bool Part of the location signature.
in_code
- Bool Part of the location signature.
in_headline
- Bool Part of the location signature.
in_list
- Bool Part of the location signature.
in_paragraph
- Bool Part of the location signature.
contains_headline
- Bool Whether the link contains a headline element.
destination_type
- Integer / Enumeration Null The link destination type of this link
- Null means that the link destination type couldn't be derived for some unknown reason.
link_locality
- Integer / Enumeration The link locality describing how far reaching this link is.
html_tag_name
- Varchar Null The name of the HTML element the link appeared in.
- Null means that the link wasn't inside an HTML document or that there is no HTML equivalent for the link container (prefer using the
destination_type
). text
- Varchar Null Human readable text that describes the link.
- Null means the link didn't have any text to describe itself.
document_description
Note: All data in this table is derived from the document on a best effort basis.
The document_description
table has the following fields:
entity_generation_id
-
Integer Primary Key Reference to the
entity_generation
table of the entity generation extended by this document description. time_created_unix_utc
- Integer / Timestamp Null When the document was created.
- Null means that this information is unavailable.
time_updated_unix_utc
- Integer / Timestamp Null When the document was last updated.
- Null means that this information is unavailable.
primary_language
- Varchar Null The language code that represents the primary document language. should be a member of the iana language subtag registry.
- Null means that this information is unavailable.
title
- Text Null The title of the document that would be appropriate to display on a browser tab.
- Null means that this information is unavailable.
primary_headline
- Text Null The main headline of the document. This should not be the site name.
- Null means that this information is unavailable.
description
- Text Null A short description of the document.
- Null means that this information is unavailable.
indexiness
- [Integer] The indexiness score of the document.
text_pile
The text_pile
table stores text piles, it has the following fields:
text_pile_id
- Integer Primary Key
blake2b512_digest
- Blob Unique The text pile digest that is used for exact duplicate detection, it also makes sure each possible text pile is only stored once.
text
- Text Main file content, empty string if not present
secondary_text
- Text Secondary file content, empty string if not present
big_headlines
- Text Bigger headlines, empty string if not present
small_headlines
- Text Smaller headlines, empty string if not present
code_text
- Text Text marked up as code, empty string if not present
quote_text
- Text Text marked up as quote, empty string if not present
token
The token
table has the following fields:
token_statistics
The token_statistics
table has the following fields:
token_id
-
Integer Reference to the
token
table of the token this is about. text_pile_id
-
Integer Reference to the
text_pile
table of the text pile this is about. occurances
- Integer How often the token occurs in the text pile.
The combination of token_id
and text_pile_id
must be unique.
🔧 This table is work in progress man may be subject to major changes in the future.
full_text_entity_index
This table is backed by the fts5
SQLite extension. It is mainly derived from the text_pile
and document_description
tables.
The full_text_entity_index
table has the following fields:
text_pile_id
-
Integer Reference to the [
text_pile
table] of the text pile this was derived from. any
-
Text Always contains the string "any", this is needed to work around limitations of the fts5
NOT
operator. title
-
Text The document title taken from the
document_description
. description
-
Text Null The description taken from the
document_description
. - Null means that no description is available.
text
-
Text The text taken from the
text_pile
secondary_text
-
Text The secondary text taken from the
text_pile
big_headlines
-
Text The big headlines taken from the
text_pile
small_headlines
-
Text The small headlines taken from the
text_pile
code_text
-
Text The code text taken from the
text_pile
quote_text
-
Text The quote text taken from the
text_pile
Version History
0.0.0 - The first versioned
This is the first versioned and documented version of this database.