Text Pile Gen2

The Text Pile Gen2 (also known as Text Pile v2 or "NG") is a datastructure for storing text content along with limited formatting and optional segmenting information. It is extracted from pages during the summarizing step.

It is implemented in the unobtanium-text-pile crate.

This replaced the Text Pile Gen1: The Text Pile was used in unobtanium release 3.0.0 and older in place of the datastructure described on this page.

The Text Pile NG stores the following data:

The raw page text in a cleaned up form
Metadata:
- Semantic markers (roughly: HTML tag semantics)
- The language indicated in the document
Optional segmentation information ("segmentation cache"):
- Segment lengths
- A relevance marker for each segment
- Sentence lengths
- The language guessed by the segmenter

The segmentation information is intended to be updated along with the segmentation algorithm.

Digest

The digest of the Text Pile Ng is calculated using the Blake2b 512 bit algorithm. It is used for detection of exact duplicates.

The Digest is calculated across the shortest possible postcard serialization of the text and metadata sections of the text pile ng.

The segmentation cache is not included as it is derived data and may change over time to allow the segmentation algorithm to be updated.

Comparison with the previous Text Pile

The old Text Pile was optimized for use with the sqlite fts5 extension and had multiple strings, depending on the semantics the text was appended to one of them.

This had the following drawbacks:

It couldn't recall the original text in original order for previews
No segmentation cache means running expensive segmentation algorithm over and over
No preservation of language metadata, which is important for segmentation
Only limited preservation of document semantics
A lot of duplicate storage as text could be appended to multiple fields

Data: Text Pile Gen2

Digest

Comparison with the previous Text Pile

See Also