Data: Text Pile Gen2

The Text Pile Gen2 (also known as Text Pile v2 or "NG") is a datastructure for storing text content along with limited formatting and optional segmenting information. It is extracted from pages during the summarizing step.

It is implemented in the unobtanium-text-pile crate.

This replaced the Text Pile Gen1: The Text Pile was used in unobtanium release 3.0.0 and older in place of the datastructure described on this page.

The Text Pile NG stores the following data:

The segmentation information is intended to be updated along with the segmentation algorithm.

Digest

The digest of the Text Pile Ng is calculated using the Blake2b 512 bit algorithm. It is used for detection of exact duplicates.

The Digest is calculated across the shortest possible postcard serialization of the text and metadata sections of the text pile ng.

The segmentation cache is not included as it is derived data and may change over time to allow the segmentation algorithm to be updated.

Comparison with the previous Text Pile

The old Text Pile was optimized for use with the sqlite fts5 extension and had multiple strings, depending on the semantics the text was appended to one of them.

This had the following drawbacks:

See Also