Concept: Duplicate Types

In general duplicates are different files that have the same main content, that is relevant for indexing.

Exact Duplicates

Exact duplicates are detected using the Text Pile Digest, they are byte for byte duplicates of the original. Link destinations and Formatting (unless relevant for the Text Pile) are ignored.

Self-Duplicates

Self-Duplicates are a special case of exact duplicates where a duplication is detected to content that came from the same URL earlier. If that is exactly the previous version unobtanium knows that the main content of the page hasn't changed, even if the Server didn't explicitly say so.