Summary schema living documentation

The summary database scheme is implemented on top of the base database schema and mainly houses the entity data tree. It is mainly built by the summarizes algorithm.

This is a living document that describes the database as it is on the main branch.

Overview

Tables in the summary database are:

entity_generation: The root of all summary data, spacetime coordinates for websites. See entity generation.
duplicate_summary: Information about exact duplicates.
crawl_summary: Stores the crawl summaries that is derived from the crawl log.
http_summary: An entry in this table is only present if the crawl resulted in a HTTP response. Extension of the crawl_summary table.
file_summary: File metadata derived from the file table in the crawler database.
redirect_summary: Redirect metadata derived from the redirect table in the crawler database.
link_summary: Lists parsed links in files.
document_description: Metadata derived from the document on a best effort basis.
text_pile_v0_2: Contains the Text Pile Gen2 as generated by version 0.2 of the unobtanium-text-pile crate.; Since 0.0.0-textrank-preview.0
token_v2: Enumerates all possible tokens that can make up a document in this database.; Since 0.2.0-token-index-try-2.0
token_statistics: Maps tokens to text piles, also referred to as the token index.
token_text: Stores possible text values for the token_v2 table.; Since 0.2.0-token-index-try-2.0
token_normalization: Stores how token texts normalize under diffferent languages.; Since 0.2.0-token-index-try-2.0
token_idf: Cache for inverse doucument frequencies for every token.; Since 0.0.0-textrank-preview.2

Tables

`entity_generation`

The root of all summary data, spacetime coordinates for websites. See entity generation.

The entity_generation table has the following fields:

entity_generation_id: Integer Primary Key
entity_generation_uuid: UUID External Key Communicated to refer to this entity generatoin on external interfaces.
url_id: Integer Reference to the url table, which URL this entity generation is about.
url_fragment: Text Null The id of the document element this entity generation is about.; Null means that this entity generation is about the whole document.
first_seen_unix_utc: Integer/Timestamp First known existence of this generation (may change if better data is integrated).
last_seen_unix_utc: Integer/Timestamp Last known existence, may equal first_seen_unix_utc.
confirmed_end_unix_utc: Integer/Timestamp Null The time the entity generation is known to be closed. (i.e. the first_seen of the next entity generation); Null means that the entity generation is still open.
marked_duplicate: Bool Whether this is marked as a duplicate in the duplicate_summary table. This is a cache value intended to accelerate queries by requiring less joining.
text_pile_id: Integer Null Reference to the text_pile table, that contains the text content that this entity generation represents.; Null means that this entity generation does not have any searchable text attached.
text_pile_v0_2_id: Integer Null Reference to the text_pile_v0_2 table, that contains the text content this entity generation represents.; Null means that this entity generation does not have any searchable text attached.; Since 0.0.0-textrank-preview.0

Constraints:

Check for first_seen_unix_utc <= last_seen_unix_utc

Indices:

entity_generation_by_uuid: On field entity_generation_uuid; Speeds up looking for entity generations by UUID.
entity_generation_by_first_seen: On fields url_id, url_fragment and first_seen_unix_utc; Speeds up looking for entity generations based on when they were first seen for a given URL.
entity_generation_by_text_pile: On fields text_pile_v0_2_id and confirmed_end_unix_utc; Speeds up looking for non closed entity generations given a text_pile_v0_2_id, this is needed for searching.
entity_generation_by_active: On fields marked_duplicate and confirmed_end_unix_utc; Speeds up querying for (non) duplicates.

`duplicate_summary`

Information about exact duplicates.

The duplicate_summary table has the following fields:

subject_entity_generation_id: Integer Reference to the entity_generation table, of the entity generation that is flagged as duplicate.
duplicate_of_entity_generation_id: Integer Reference to the entity_generation table, of the entity generation that the flagged entity generation is a duplicate of.
duplicate_status_start_unix_utc: Integer/Timestamp When the duplicate status started (usually the time it was detected).
duplicate_status_end_unix_utc: Integer/Timestamp Null When the duplicate status ended; Null means that the duplicate status hasn't ended yet.

Note: There is no way to directly address the duplicate summaries, they are always referenced as being the attachments of an entity generation.

Indices:

duplicate_summary_by_duplicate: On fields subject_entity_generation_id and duplicate_status_end_unix_utc; Speeds up looking if some entity generation is a non-expired duplicate.

`crawl_summary`

Stores the crawl summaries that is derived from the crawl log.

The crawl_summary table has the following fields:

crawl_summary_id: Integer Primary Key
entity_generation_id: Integer Reference to the entity_generation table, of the entity generation that this crawl resulted in.
was_robotstxt_approved: Bool Whether the crawl was approved by a robots.txt file.
crawl_type: Integer/Enumeration Which crawl type resulted in this entry.
crawl_uuid: UUID The UUID of the corresponding crawl log entry.
agent_uuid: UUID The UUID of the crawl agent that made the requests.
time_started_unix_utc: Integer/Timestamp The time when the crawl action was started.
exit_code: Integer/Enumeration The crawl exit code of the crawl action.
time_last_modified_unix_utc: Integer/Timestamp Null When the resource was last modified according to the server; Null means that the server didn't communicate a last modification time.
request_duration_ms: Integer/Timestamp Null How long it took to execute the crawl action.; Null means that the measurement failed.

Indices:

crawl_summary_by_crawl_uuid: On field crawl_uuid; Speeds up looking for already integrated crawls via test_has_crawl_summary_with_uuid_bulk().

`http_summary`

An entry in this table is only present if the crawl resulted in a HTTP response. Extension of the crawl_summary table.

The http_summary table has the following fields:

crawl_summary_id: Integer Primary Key Reference to the crawl_summary table of the extended crawl summary.
status_code: Integer/Enumeration The resulting HTTP status code.
etag: Text Null The resulting HTTP ETag header.; Null means that no ETag header was returned.

`file_summary`

File metadata derived from the file table in the crawler database.

The file_summary table has the following fields:

entity_generation_id: Integer Primary Key Reference to the entity_generation table of the extended entity generation. (There can only be one file per entity generation)
file_size: Integer Null The file size in bytes.; Null means that the file size is not known.
mimetype_id: Integer Null Reference to the mimetype table of the files mimetype.; Null means that the mimetype is not known (i.e. when there was a conflict detected between claimed mimetype and file content)
canonical_url_id: Integer Null Reference to the url table, which URL the file claims its canonical version to be at.; Null means that the file didn't claim the be the non-canonical version of another resource.

`redirect_summary`

Redirect metadata derived from the redirect table in the crawler database.

The redirect_summary table has the following fields:

entity_generation_id: Integer Primary Key Reference to the entity_generation table of the extended entity generation. (There can only be one redirect per entity.
to_url_id: Integer Reference to the url table, which URL the redirect points at.
information_source: Integer/Enumeration Where the information for this redirect came from, see Information Source for possible values.
is_permanent: Bool Wether the Server indicated that the redirect is a stable redirection of the address that can be assumed still valid without checking.
by_security_policy: Bool Whether the redirect was because of a security policy.; 🔧 This field is currently unused and may be removed in the future.
to_url_fragment: Text Null Fragment part of the URL that the redirect points at.; Null means that the destination URL has no fragment part.; 🔧 This field is currently unused as there is no equivalent in the crawler database, it could be populated from HTML redirects.

`link_summary`

Lists parsed links in files.

The link_summary table has the following fields:

link_summary_id: Integer Primary Key
entity_generation_id: Integer Reference to the entity_generation table of the entity generation this link is part of.
link_to_url: Integer Reference to the url table, which URL the link points at.
link_to_fragment: Text Null Fragment part of the URL that the link points at.; Null means that the destination URL has no fragment part.
rel_nofollow: Bool Whether the link was tagged with a rel="nofollow"
rel_me: Bool Whether the link was tagged with a rel="me"
rel_tag: Bool Whether the link was tagged with a rel="tag"
in_header: Bool Part of the location signature.
in_footer: Bool Part of the location signature.
in_aside: Bool Part of the location signature.
in_nav: Bool Part of the location signature.
in_form: Bool Part of the location signature.
in_main: Bool Part of the location signature.
in_article: Bool Part of the location signature.
in_section: Bool Part of the location signature.
in_table: Bool Part of the location signature.
in_figure: Bool Part of the location signature.
in_address: Bool Part of the location signature.
in_code: Bool Part of the location signature.
in_headline: Bool Part of the location signature.
in_list: Bool Part of the location signature.
in_paragraph: Bool Part of the location signature.
contains_headline: Bool Whether the link contains a headline element.
destination_type: Integer/Enumeration Null The link destination type of this link.; Null means that the link destination type couldn't be derived for some unknown reason.
link_locality: Integer/Enumeration The link locality describing how far reaching this link is.
html_tag_name: Varchar Null The name of the HTML element the link appeared in.; Null means that the link wasn't inside an HTML document or that there is no HTML equivalent for the link container (prefer using the destination_type).
text: Varchar Null Human readable text that describes the link.; Null means the link didn't have any text to describe itself.

`document_description`

Metadata derived from the document on a best effort basis.

The document_description table has the following fields:

entity_generation_id: Integer Primary Key Reference to the entity_generation table of the entity generation extended by this document description.
time_created_unix_utc: Integer/Timestamp Null When the document was created.; Null means that this information is unavailable.
time_updated_unix_utc: Integer/Timestamp Null When the document was last updated.; Null means that this information is unavailable.
primary_language: Varchar Null The language code that represents the primary document language. should be a member of the iana language subtag registry.; Null means that this information is unavailable.
title: Text Null The title of the document that would be appropriate to display on a browser tab.; Null means that this information is unavailable.
primary_headline: Text Null The main headline of the document. This should not be the site name.; Null means that this information is unavailable.
description: Text Null A short description of the document.; Null means that this information is unavailable.
indexiness: Integer The indexiness score of the document.

`text_pile_v0_2`

Contains the Text Pile Gen2 as generated by version 0.2 of the unobtanium-text-pile crate.

The text_pile_v0_2 table has the following fields:

text_pile_v0_2_id: Integer Primary Key
blake2b512_digest: Blob Unique The text pile Gen2 digest that is used for exact duplicate detection, it also makes sure each possible text pile Gen2 is only stored once.
text: Text The raw text behind the text pile Gen2.
metadata: Blob The text pile Gen2 metadata encoded as postcard.
segmentation_cache: Blob The segmentation metadata of the text pile Gen2, non optional.
relevant_segments: Integer The number of segements that have their is_relevant flag set, this is used for ranking.; Since 0.0.0-textrank-preview.2

Note on field segmentation_cache: The segmentation cache in the database is deliberately non optional to make handling of text data easier on the consumer side. This is contrary to the segmentation cache being an optional component of a text pile Gen2 elsewhere.

`token_v2`

Enumerates all possible tokens that can make up a document in this database.

The token_v2 table has the following fields:

token_v2_id: Integer Primary Key
token_text_id: Integer Reference to the token_text table. The text represented by this token.
token_language_enl: Integer/Enl The langage this token is in encoded as enumerated latin, 1 means unknown.
is_normalized: Bool Wheather the token has already been processed by the normalization stage of the summarizer.; Since 0.2.0-token-index-try-2.1

Constraints:

Unique on token_v2_id and token_language_enl

`token_statistics`

Maps tokens to text piles, also referred to as the token index.

The token_statistics table has the following fields:

token_v2_id: Integer Reference to the token_v2 table of the token this is about.
text_pile_v0_2_id: Integer Reference to the text_pile_v0_2 table of the text pile this is about.
occurances: Integer How often the token occurs in the text pile.

Constraints:

Unique on token_v2_id and text_pile_v0_2_id

🔧 This table is work in progress man may be subject to major changes in the future.

`token_text`

Stores possible text values for the token_v2 table.

The token_text table has the following fields:

token_text_id: Integer Primary Key
text: Text Unique The acutal token text.

`token_normalization`

Stores how token texts normalize under diffferent languages.

The token_normalization table has the following fields:

from_token_text_id: Integer Which token text is normalized
to_token_text_id: Integer The result of the normalization
using_language_enl: Integer/Enl Which language rules were used for running the noramlization. 1 means the normalization was done using a langage indenpendent algorithm.

Constraints:

Unique on from_token_text_id and using_language_enl
Check for from_token_text_id != to_token_text_id

Note: There was a trigger on_token_text_after_insert_normalize_to_self on the token_text table that inserts a normalization to self with using_language_enl set to 1 for every new token text. (removed in 0.3.0)

`token_idf`

Cache for inverse doucument frequencies for every token.

The token_idf table has the following fields:

token_v2_id: Integer Primary Key Reference to the token_v2 table
idf: Real Floating point inverse document frequency value for this token.