Manual: Terminology

This page describes some terms used across Unobtanium which have s specific meaning within Unobtanium.

Terms referring to things outside of Unobtanium

Homepage
A webpage that is the entry-point and main navigation hub for a website. Each site may only have one homepage. It is usually linked in the header of each page on a name or icon representing the site.
Webpage
A single Document with one main file and maybe some auxiliary files that can usually be fetched over HTTP(s).
Website/Site
A website is a collection of one or more pages that belong together in some sense (even if it is just a collection of random pages). That collection usually is the set of all pages sharing the same origin, though this may not always be the case (i.e. the same site being accessible via HTTP and HTTPS or a public Unix system where each /~user/ path has its own website, a big web presence available in many languages may also be considered one site per language.)
URL-Origin/Origin
With an URLs Origin Unobtanium refers to the combination of the protocol, hostname and port fields. Origins which omit the port and origins which explicitly specify the standard port for an URL are considered equivalent.

Terms referring to things inside Unobtanium

Entity
An entity refers to a (search-)queryable entry in the Unobtanium database. Each entity is uniquely identified by its entity-generation-UUID.
Entity-Generation
An entity-generation refers to the generation metadata attached to an entity describing its location (URL) and lifetime. If the result from querying a URL changes in a way that is significant to Unobtanium a new entity-generation is created.
Since "entity-generation" is pretty long it is abbreviated eg in some places.
Exact Duplicate
A Page with exactly the same main-content as another
Fluke-Event
An unlikely and temporary error where immediately retrying is a valid strategy.
Patience
Patience is a metric in the crawler represented by a countdown, to make the crawler eventually give up on unreachable/broken origins.
Self-Duplicate
Self-Duplicates are exact duplicates, but they came from the same URL. They are used as an indicator of whether a page has changed or not.