The Text Pile is a datastructure for storing unformatted content text that is extracted from pages during the summarizing step.
The purpose of the text pile is to have a data source for the full text search that can also be used to generate preview snippets from any relevant point in the text.
The text pile has the following fields:
text
- What has been determined the main file content, roughly one line per paragrph, also includes headlines, code, quotes etc.
secondary_text
- Like
text
, but for everything that is not marked as the main file content. big_headlines
- Bigger headlines (down to level 2 or 3, depending on derivation algorithm), headlines are newline seperated.
small_headlines
- Headlines below the level of what goes into
big_headlines
, same format. code_text
- Text that was marked up as code.
quote_text
- Text that is marked up as being quoted.
Note that text
and secondary_text
together contain all of the pages content, the other fields are effecitively duplicates of specific subsections for the purpose weighting them differently.
If there is no text for a given field it must contain an empty string.
Leading and trailing space characters must be stripped.
Digest
To quicly detect and eliminate exact duplicates ech Text Pile has a digest that is calculated using the Blake2b 512bit algorithm.
The hash is fed each field of the text pile, with a delimiter made up of three linebreaks between each two fields \n\n\n
. If a field is empty it is fed as an empty string.
They are fed to the has function in the following order:
text
secondary_text
big_headlines
small_headlines
code_text
quote_text