Text Pile

The Text Pile is a datastructure for storing unformatted content text that is extracted from pages during the summarizing step.

The purpose of the text pile is to have a data source for the full text search that can also be used to generate preview snippets from any relevant point in the text.

The text pile has the following fields:

text: What has been determined the main file content, roughly one line per paragrph, also includes headlines, code, quotes etc.
secondary_text: Like text, but for everything that is not marked as the main file content.
big_headlines: Bigger headlines (down to level 2 or 3, depending on derivation algorithm), headlines are newline seperated.
small_headlines: Headlines below the level of what goes into big_headlines, same format.
code_text: Text that was marked up as code.
quote_text: Text that is marked up as being quoted.

Note that text and secondary_text together contain all of the pages content, the other fields are effecitively duplicates of specific subsections for the purpose weighting them differently.

If there is no text for a given field it must contain an empty string.

Leading and trailing space characters must be stripped.

Digest

To quicly detect and eliminate exact duplicates ech Text Pile has a digest that is calculated using the Blake2b 512bit algorithm.

The hash is fed each field of the text pile, with a delimiter made up of three linebreaks between each two fields \n\n\n. If a field is empty it is fed as an empty string.

They are fed to the has function in the following order:

text
secondary_text
big_headlines
small_headlines
code_text
quote_text