Data: Summary Database

The summary database scheme is implemented on top of the base database schema and mainly houses the entity data tree. It is mainly built by the summarizes algorithm.

Overview

Tables in the base database are:

entity_generation
The root of all summary data, spacetime coordinates for websites. See entity generation.
duplicate_summary
Information about exact duplicates.
crawl_summary
Stores the crawl summaries that is derived from the crawl log.
http_summary
HTTP extension for the crawl_summary table.
file_summary
File metadata derived from the file table in the crawler database.
redirect_summary
Redirect metadata derived from the redirect table in the crawler database.
link_summary
Lists parsed links in files.
document_description
Document metadata.
text_pile
Collection of document text indexed by hash, see text pile.
token
Enumerates all possible tokens that can make up a document in this database.
token_statistics
Maps tokens to text piles, also referred to as the token index, this is experimental.

Tables

entity_generation

The entity_generation table stores entity generations, it has the following fields:

entity_generation_id
Integer Primary Key
entity_generation_uuid
UUID External Key
url_id
Integer Reference to the url table, which URL this entity generation is about.
url_fragment
Text Null The id of the document element this entity generation is about.
Null means that this entity generation is about the whole document
first_seen_unix_utc
Integer / Timestamp First known existence of this generation (may change if better data is integrated)
last_seen_unix_utc
Integer / Timestamp Last known existence, may equal first_seen_unix_utc.
confirmed_end_unix_utc
Integer / Timestamp Null The time the entity generation is known to be closed. (i.e. the first_seen of the next entity generation)
Null means that the entity generation is still open.
marked_duplicate
Bool Whether this is marked as a duplicate in the duplicate_summary table. This is a cache value intended to accelerate queries by requiring less joining.
text_pile_id
Integer Null Reference to the text_pile table, that references the text content that this entity generation represents.
Null means that this entity generation does not have any searchable text attached.

There is a check that first_seen_unix_utc <= last_seen_unix_utc.

Indices:

entity_generation_by_uuid
On field entity_generation_uuid
Speeds up looking for entity generations by UUID.
entity_generation_by_first_seen
On fields url_id, url_fragment and first_seen_unix_utc
Speeds up looking for entity generations based on when they were first seen for a given URL.
entity_generation_by_text_pile
On fields text_pile_id and confirmed_end_unix_utc
Speeds up looking for non closed entity generations given a text_pile_id, this is needed for searching.
entity_generation_by_active
On fields marked_duplicate, confirmed_end_unix_utc
Speeds up querying for (non) duplicates.

duplicate_summary

Note: There is no way to directly address the duplicate summaries, they are always referenced as being the attachments of an entity generation.

The duplicate_summary table has the following fields:

subject_entity_generation_id
Integer Reference to the entity_generation table, of the entity generation that is flagged as duplicate.
duplicate_of_entity_generation_id
Integer Reference to the entity_generation table, of the entity generation that the flagged entity generation is a duplicate of.
duplicate_status_start_unix_utc
Integer / Timestamp When the duplicate status started (usually the time it was detected)
duplicate_status_end_unix_utc
Integer / Timestamp Null When the duplicate status ended
Null means that the duplicate status hasn't ended yet.

Indices:

duplicate_summary_by_duplicate
On fields subject_entity_generation_id and duplicate_status_end_unix_utc
Speeds up looking if some entity generation is a non-expired duplicate.

crawl_summary

The crawl_summary table has the following fields:

crawl_summary_id
Integer Primary Key
entity_generation_id
Integer Reference to the [entity_generation table], of the entity generation that this crawl resulted in.
was_robotstxt_approved
Bool Whether the crawl was approved by a robots.txt file.
crawl_type
Integer / Enumeration Which crawl type resulted in this entry.
crawl_uuid
UUID The UUID of the corresponding crawl log entry.
agent_uuid
UUID The UUID of the crawl agent that made the requests.
time_started_unix_utc
Integer / Timestamp The time when the crawl action was started.
exit_code
Integer / Enumeration The crawl exit code of the crawl action.
request_duration_ms
Integer / Timestamp Null How long it took to execute the crawl action.
Null means that the measurement failed.

Indices:

crawl_summary_by_crawl_uuid
On field crawl_uuid
Speeds up looking for already integrated crawls via test_has_crawl_summary_with_uuid_bulk().

http_summary

An entry in this table is only present if the crawl resulted in a HTTP response.

The http_summary table has the following fields:

crawl_summary_id
Integer Primary Key Reference to the crawl_summary table of the extended crawl summary.
status_code
Integer / Enumeration The resulting HTTP status code.
etag
Text Null The resulting HTTP ETag header.
Null means that no ETag header was returned.

file_summary

The file_summary table has the following fields:

entity_generation_id
Integer Primary KeyUnique Reference to the entity_generation table of the extended entity generation. (There can only be one file per entity generation)
file_size
Integer Null The file size in bytes
Null means that the file size is not known.
mimetype_id
Integer Null Reference to the mimetype table of the files mimetype.
Null means that the mimetype is not known (i.e. when there was a conflict detected between claimed mimetype and file content)
canonical_url_id
Integer Null Reference to the url table, which URL the file claims its canonical version to be at.
Null means that the file didn't claim the be the non-canonical version of another resource.

redirect_summary

The redirect_summary table has the following fields:

entity_generation_id
Integer Primary Key Reference to the entity_generation table of the extended entity generation. (There can only be one redirect per entity generation)
to_url_id
Integer Reference to the url table, which URL the redirect points at.
information_source
Integer / Enumeration Where the information for this redirect came from, see Information Source for possible values.
by_security_policy
Bool Whether the redirect was because of a security policy.
🔧 This field is currently unused and may be removed in the future.
to_url_fragment
Text Null Fragment part of the URL that the redirect points at.
Null means that the destination URL has no fragment part.
🔧 This field is currently unused as there is no equivalent in the crawler database, it could be populated from HTML redirects.

The link_summary table has the following fields:

link_summary_id
Integer Primary Key
entity_generation_id
Integer Primary Key Reference to the entity_generation table of the entity generation this link is part of.
link_to_url
Integer Reference to the url table, which URL the link points at.
link_to_fragment
Text Null Fragment part of the URL that the link points at.
Null means that the destination URL has no fragment part.
rel_nofollow
Bool Whether the link was tagged with a rel="nofollow"
rel_me
Bool Whether the link was tagged with a rel="me"
rel_tag
Bool Whether the link was tagged with a rel="tag"
in_header
Bool Part of the location signature.
in_footer
Bool Part of the location signature.
in_aside
Bool Part of the location signature.
in_nav
Bool Part of the location signature.
in_form
Bool Part of the location signature.
in_main
Bool Part of the location signature.
in_article
Bool Part of the location signature.
in_section
Bool Part of the location signature.
in_table
Bool Part of the location signature.
in_figure
Bool Part of the location signature.
in_address
Bool Part of the location signature.
in_code
Bool Part of the location signature.
in_headline
Bool Part of the location signature.
in_list
Bool Part of the location signature.
in_paragraph
Bool Part of the location signature.
contains_headline
Bool Whether the link contains a headline element.
destination_type
Integer / Enumeration Null The link destination type of this link
Null means that the link destination type couldn't be derived for some unknown reason.
link_locality
Integer / Enumeration The link locality describing how far reaching this link is.
html_tag_name
Varchar Null The name of the HTML element the link appeared in.
Null means that the link wasn't inside an HTML document or that there is no HTML equivalent for the link container (prefer using the destination_type).
text
Varchar Null Human readable text that describes the link.
Null means the link didn't have any text to describe itself.

document_description

Note: All data in this table is derived from the document on a best effort basis.

The document_description table has the following fields:

entity_generation_id
Integer Primary Key Reference to the entity_generation table of the entity generation extended by this document description.
time_created_unix_utc
Integer / Timestamp Null When the document was created.
Null means that this information is unavailable.
time_updated_unix_utc
Integer / Timestamp Null When the document was last updated.
Null means that this information is unavailable.
primary_language
Varchar Null The language code that represents the primary document language. should be a member of the iana language subtag registry.
Null means that this information is unavailable.
title
Text Null The title of the document that would be appropriate to display on a browser tab.
Null means that this information is unavailable.
primary_headline
Text Null The main headline of the document. This should not be the site name.
Null means that this information is unavailable.
description
Text Null A short description of the document.
Null means that this information is unavailable.
indexiness
[Integer] The indexiness score of the document.

text_pile

The text_pile table stores text piles, it has the following fields:

text_pile_id
Integer Primary Key
blake2b512_digest
Blob Unique The text pile digest that is used for exact duplicate detection, it also makes sure each possible text pile is only stored once.
text
Text Main file content, empty string if not present
secondary_text
Text Secondary file content, empty string if not present
big_headlines
Text Bigger headlines, empty string if not present
small_headlines
Text Smaller headlines, empty string if not present
code_text
Text Text marked up as code, empty string if not present
quote_text
Text Text marked up as quote, empty string if not present

token

The token table has the following fields:

token_id
Integer Primary Key
token_text
Text Unique The text represented by this token.

token_statistics

The token_statistics table has the following fields:

token_id
Integer Reference to the token table of the token this is about.
text_pile_id
Integer Reference to the text_pile table of the text pile this is about.
occurances
Integer How often the token occurs in the text pile.

The combination of token_id and text_pile_id must be unique.

🔧 This table is work in progress man may be subject to major changes in the future.

full_text_entity_index

This table is backed by the fts5 SQLite extension. It is mainly derived from the text_pile and document_description tables.

The full_text_entity_index table has the following fields:

text_pile_id
Integer Reference to the [text_pile table] of the text pile this was derived from.
any
Text Always contains the string "any", this is needed to work around limitations of the fts5 NOT operator.
title
Text The document title taken from the document_description.
description
Text Null The description taken from the document_description.
Null means that no description is available.
text
Text The text taken from the text_pile
secondary_text
Text The secondary text taken from the text_pile
big_headlines
Text The big headlines taken from the text_pile
small_headlines
Text The small headlines taken from the text_pile
code_text
Text The code text taken from the text_pile
quote_text
Text The quote text taken from the text_pile

Version History

0.0.0 - The first versioned

This is the first versioned and documented version of this database.