The summary database scheme is implemented on top of the base database schema and mainly houses the entity data tree. It is mainly built by the summarizes algorithm.
This is a living document that describes the database as it is on the main branch.
Overview
Tables in the base database are:
-
entity_generation - The root of all summary data, spacetime coordinates for websites. See entity generation.
-
duplicate_summary - Information about exact duplicates.
-
crawl_summary - Stores the crawl summaries that is derived from the crawl log.
-
http_summary -
HTTP extension for the
crawl_summarytable. -
file_summary -
File metadata derived from the
filetable in the crawler database. -
redirect_summary -
Redirect metadata derived from the
redirecttable in the crawler database. -
link_summary - Lists parsed links in files.
-
document_description - Document metadata.
-
text_pile_v0_2 -
Contains the Text Pile Gen2 as generated by version 0.2 of the
unobtanium-text-pilecrate. -
Since
0.0.0-textrank-preview.0 -
token_v2 - Enumerates all possible tokens that can make up a document in this database.
-
Since
0.2.0-token-index-try-2.0 -
token_statistics - Maps tokens to text piles, also referred to as the token index, this is experimental.
-
token_text -
Stores texts for the
token_v2table. -
Since
0.2.0-token-index-try-2.0 -
token_normalization - Stores normalizations for token texts.
-
Since
0.2.0-token-index-try-2.0 -
token_idf - Stores normalizations for token texts.
-
Since
0.0.0-textrank-preview.2
Tables
entity_generation
The entity_generation table stores entity generations, it has the following fields:
The There is a check that Indices: Note: There is no way to directly address the duplicate summaries, they are always referenced as being the attachments of an entity generation. The Indices: The Indices: An entry in this table is only present if the crawl resulted in a HTTP response. The The The The Note: All data in this table is derived from the document on a best effort basis. The The The Note on the The The The The combination of 🔧 This table is work in progress man may be subject to major changes in the future.i The The The entity_generation table has the following fields:
entity_generation_identity_generation_uuidurl_idurl table, which URL this entity generation is about.
url_fragmentfirst_seen_unix_utclast_seen_unix_utcfirst_seen_unix_utc.
confirmed_end_unix_utcfirst_seen of the next entity generation)
marked_duplicateduplicate_summary table. This is a cache value intended to accelerate queries by requiring less joining.
text_pile_idtext_pile table, that contains the text content that this entity generation represents.
text_pile_v0_2_idtext_pile_v0_2 table, that contains the text content this entity generation represents.
0.0.0-textrank-preview.0first_seen_unix_utc <= last_seen_unix_utc.
entity_generation_by_uuid
entity_generation_uuid
entity_generation_by_first_seen
url_id, url_fragment and first_seen_unix_utc
entity_generation_by_text_pile
text_pile_id and confirmed_end_unix_utc
text_pile_id, this is needed for searching.
entity_generation_by_active
marked_duplicate, confirmed_end_unix_utc
duplicate_summaryduplicate_summary table has the following fields:
subject_entity_generation_identity_generation table, of the entity generation that is flagged as duplicate.
duplicate_of_entity_generation_identity_generation table, of the entity generation that the flagged entity generation is a duplicate of.
duplicate_status_start_unix_utcduplicate_status_end_unix_utc
duplicate_summary_by_duplicate
subject_entity_generation_id and duplicate_status_end_unix_utc
crawl_summarycrawl_summary table has the following fields:
crawl_summary_identity_generation_identity_generation table, of the entity generation that this crawl resulted in.
was_robotstxt_approvedrobots.txt file.
crawl_typecrawl_uuidagent_uuidtime_started_unix_utcexit_codetime_last_modified_unix_utcrequest_duration_ms
crawl_summary_by_crawl_uuid
crawl_uuid
test_has_crawl_summary_with_uuid_bulk().
http_summaryhttp_summary table has the following fields:
crawl_summary_idcrawl_summary table of the extended crawl summary.
status_codeetagfile_summaryfile_summary table has the following fields:
entity_generation_identity_generation table of the extended entity generation. (There can only be one file per entity generation)
file_sizemimetype_idmimetype table of the files mimetype.
canonical_url_idurl table, which URL the file claims its canonical version to be at.
redirect_summaryredirect_summary table has the following fields:
entity_generation_identity_generation table of the extended entity generation. (There can only be one redirect per entity.
to_url_idurl table, which URL the redirect points at.
information_sourceis_permanentby_security_policyto_url_fragmentlink_summarylink_summary table has the following fields:
link_summary_identity_generation_identity_generation table of the entity generation this link is part of.
link_to_urlurl table, which URL the link points at.
link_to_fragmentrel_nofollowrel="nofollow"
rel_merel="me"
rel_tagrel="tag"
in_headerin_footerin_asidein_navin_formin_mainin_articlein_sectionin_tablein_figurein_addressin_codein_headlinein_listin_paragraphcontains_headlinedestination_typelink_localityhtml_tag_namedestination_type).textdocument_descriptiondocument_description table has the following fields:
entity_generation_identity_generation table of the entity generation extended by this document description.
time_created_unix_utctime_updated_unix_utcprimary_languagetitleprimary_headlinedescriptionindexinesstext_pile_v0_2text_pile_v0_2 table stores text piles Gen2.text_pile_v0_2 table has the following fields:
text_pile_v0_2_idblake2b512_digesttextmetadatasegmentation_cacherelevant_segmentsis_relevant flag set, this is used for ranking.
0.0.0-textrank-preview.2segmentation_cache: The segmentation cache in the database is deliberately non optional to make handling of text data easier on the consumer side. This is contrary to the segmentation cache being an optional component of a text pile Gen2 elsewhere.token_v2token_v2 table has the following fields:
token_v2_idtoken_text_idtoken_text table. The text represented by this token.
token_language_enl1 means unknown.
is_normalized0.2.0-token-index-try-2.1token_text_id and token_language_enl uniquely identify a token.token_statisticstoken_statistics table has the following fields:
token_v2_idtoken_v2 table of the token this is about.
text_pile_v0_2_idtext_pile_v0_2 table of the text pile this is about.
occurancestoken_id and text_pile_id must be unique.token_texttoken_text table has the following fields:token_normalizationtoken_normalization table has the following fields:
from_token_text_idto_token_text_idusing_language_enl1 means the normalization was done using a langage indenpendent algorithm.
token_idftoken_idf table has the following fields:
token_v2_idtoken_v2 table
idf