Summarizing

Note: The crawl loop is built into the unobtanium-crawler summarize command.

The summarizing algorithm takes a crawl database and turns it into the summary database, this algorithm processes all files that are available for search later.

This is done by iterating all crawl log entries and then deciding what to do with them.

For easier understanding the algorithm is described as if every file was processed individually, in reality it is batched.

The algorithm is implemented in crawler/src/summarizer as part of the crawler.

Overview

Fetch file information from the crawl database
Cross check with the summary database to only integrate non-integrated entities
Turn the raw informtion from the crawler into scrape results
Detect self-duplicates and generate candidates for exact duplicate detection
Turn self duplicates into crawl summaries
Generate metatadata for all non self-duplicates:
- Crawl summaries
- Entity generations to create
- Entity generations to close
- Link summaries
- File summaries
Store all data derived from both self-duplicates and non self-duplicates into the summary database.
Flag exact duplicates using the duplicate candidates from earlier

Steps in Detail

Fetch file_info from the crawl database.

Fetch the corresponding crawl_log_entry

Test if the summary database already has crawl summary matching the the crawl_log_entry.uuid, if yes that file has alredy been summarized in a previous run and the algorithm skips the file.

Self-Duplicate Detection

Note: At this point a lot of independet things will be started concurrently. This is mostly done to keep the batch iteration count low.

Derive the file_summary (text_pile + document_desciption) and link_summaries. See Deriving Content.

If the summary database already contains a self duplicate Entity generation:

Use its UUID as the entity_generation_uuid
Add it to the mapping from file_id to entity_generation_uuids

Otherwise, not a self-duplicate:

Generate a new entity_generation_uuid
Remember URL and text pile digest for duplicate detection later
Remember file_summary and link_summaries for adding to the database later.

Endif

Generate Entity Generations

Fetch request_info for the file.

If not a self duplicate:

Generate a new entity_generation from the data collected so far:

url: taken from file_info
uuid: The entity_generation_uuid generated earlier.
first_seen: Taken from when the request was started from request_info
last_seen: Same as first_seen
time_end_confirmed: Set to None
marked_duplicate: Set to false (innocent until proven guilty)

Store entity_generation into the database.

Note: At this point things in the database can be connected with the entity_generation_uuid in the database.

Close any old Entity generation based on URL and the first_seen time.

Endif

Store file_summary and link_summaries into the database.

Integrate Crawl Information

If there is a request, associated with the file:

The crawl_summary is generated from the information in request_info and crawl_log_entry.

crawl_time: taken from request_info time sent
crawl_uuid: taken from request_info
agent_uuid: resolved from the crawl_log_entry
crawl_type: taken from the crawl_log_entry
was_robotstxt_approved: taken from request_info
server_last_modified: taken from request_info
exit_code: taken from request_info
request_duration_ms: taken from request_info
http: taken from request_info

Store the crawl_summary into the database.

Endif

Finding duplicates

To find duplicates query the database for entity generations that match the following criteria:

Must not have a confirmed end time.
Must be from the same origin as the original entity_generation (as a proxy for being on the same website)
Must have the same text_pile digest as the text_pile of the original.

From the results plus the original entity_generation the one with the shortest URL (other criteria are possible if better ones are available) is picked as the one to be marked as the original, all other entity generations in the list get marked as duplicates.

Deriving Content

The text_pile, link_summaries and document_description are assembled from by fetching file content from the database and running an appropriate scraping algorithm.

Todo: Link scraping algorithms here.

The document_desciption indexiness is calculated from the link_summaries and document_desciption using the indexiness algorithm.

The file_summary is generated from the file_info, document_description, and text_pile.