Algorithm: Summarizing

The summarizing algorithm takes a crawl database and turns it into the summary database, this algorithm processes all files that are available for search later.

This is done by iiterating all crawl log entries and then deciding what to do with them.

For easier understanding the algorithm is described as if every file was processed individually, in reality it is batched.

The algorihm is implemented in summarizer.rs as part of the crawler.

Steps

Fetch file_info from the crawl database.

Fetch the corresponding crawl_log_entry

Test if the summary database already has crawl summary matching the the crawl_log_entry.uuid, if yes that file has alredy been summarized in a previous run and the algorithm skips the file.

Self-Duplicate Detection

Note: At this point a lot of independet things will be started concurrently. This is mostly done to keep the batch iteration count low.

Derive the file_summary (text_pile + document_desciption) and link_summaries. See Deriving Content.

If the summary database already contains a self duplicate Entity generation:

Otherwise, not a self-duplicate:

Endif

Generate Entity Generations

Fetch request_info for the file.

If not a self duplicate:

Generate a new entity_generation from the data collected so far:

url
taken from file_info
uuid
The entity_generation_uuid generated earlier.
first_seen
Taken from when the request was started from request_info
last_seen
Same as first_seen
time_end_confirmed
Set to None
marked_duplicate
Set to false (innocent until proven guilty)

Store entity_generation into the database.

Note: At this point things in the database can be connected with the entity_generation_uuid in the database.

Close any old Entity generation based on URL and the first_seen time.

Endif

Store file_summary and link_summaries into the database.

Integrate Crawl Information

If there is a request, associated with the file:

The crawl_summary is generated from the information in request_info and crawl_log_entry.

crawl_time
taken from request_info time sent
crawl_uuid
taken from request_info
agent_uuid
resolved from the crawl_log_entry
crawl_type
taken from the crawl_log_entry
was_robotstxt_approved
taken from request_info
server_last_modified
taken from request_info
exit_code
taken from request_info
request_duration_ms
taken from request_info
http
taken from request_info

Store the crawl_summary into the database.

Endif

Finding duplicates

To find duplicates query the database for entity generations that match the following criteria:

From the results plus the original entity_generation the one with the shortest URL (other criteria are possible if better ones are available) is picked as the one to be marked as the original, all other entity generations in the list get marked as duplicates.

Deriving Content

The text_pile, link_summaries and document_description are assembled from by fetching file content from the database and running an appropriate scraping algorithm.

Todo: Link scraping algorithms here.

The document_desciption indexiness is calculated from the link_summaries and document_desciption using the indexiness algorithm.

The file_summary is generated from the file_info, document_description, and text_pile.