Note: The crawl loop is built into the unobtanium-crawler summarize command.
The summarizing algorithm takes a crawl database and turns it into the summary database, this algorithm processes all files that are available for search later.
This is done by iterating all crawl log entries and then deciding what to do with them.
For easier understanding the algorithm is described as if every file was processed individually, in reality it is batched.
The algorithm is implemented in crawler/src/summarizer as part of the crawler.
Overview
- Fetch file information from the crawl database
- Cross check with the summary database to only integrate non-integrated entities
- Turn the raw informtion from the crawler into scrape results
- Detect self-duplicates and generate candidates for exact duplicate detection
- Turn self duplicates into crawl summaries
- Generate metatadata for all non self-duplicates:
- Crawl summaries
- Entity generations to create
- Entity generations to close
- Link summaries
- File summaries
- Store all data derived from both self-duplicates and non self-duplicates into the summary database.
- Flag exact duplicates using the duplicate candidates from earlier
Steps in Detail
Fetch file_info from the crawl database.
Fetch the corresponding crawl_log_entry
Test if the summary database already has crawl summary matching the the crawl_log_entry.uuid, if yes that file has alredy been summarized in a previous run and the algorithm skips the file.
Self-Duplicate Detection
Note: At this point a lot of independet things will be started concurrently. This is mostly done to keep the batch iteration count low.
Derive the file_summary (text_pile + document_desciption) and link_summaries. See Deriving Content.
If the summary database already contains a self duplicate Entity generation:
- Use its UUID as the
entity_generation_uuid - Add it to the mapping from
file_idtoentity_generation_uuids
Otherwise, not a self-duplicate:
- Generate a new
entity_generation_uuid - Remember URL and text pile digest for duplicate detection later
- Remember
file_summaryandlink_summariesfor adding to the database later.
Endif
Generate Entity Generations
Fetch request_info for the file.
If not a self duplicate:
Generate a new entity_generation from the data collected so far:
url- taken from
file_info uuid- The
entity_generation_uuidgenerated earlier. first_seen- Taken from when the request was started from
request_info last_seen- Same as
first_seen time_end_confirmed- Set to
None marked_duplicate- Set to
false(innocent until proven guilty)
Store entity_generation into the database.
Note: At this point things in the database can be connected with the entity_generation_uuid in the database.
Close any old Entity generation based on URL and the first_seen time.
Endif
Store file_summary and link_summaries into the database.
Integrate Crawl Information
If there is a request, associated with the file:
The crawl_summary is generated from the information in request_info and crawl_log_entry.
crawl_time- taken from
request_infotime sent crawl_uuid- taken from
request_info agent_uuid- resolved from the
crawl_log_entry crawl_type- taken from the
crawl_log_entry was_robotstxt_approved- taken from
request_info server_last_modified- taken from
request_info exit_code- taken from
request_info request_duration_ms- taken from
request_info http- taken from
request_info
Store the crawl_summary into the database.
Endif
Finding duplicates
To find duplicates query the database for entity generations that match the following criteria:
- Must not have a confirmed end time.
- Must be from the same origin as the original
entity_generation(as a proxy for being on the same website) - Must have the same
text_piledigest as thetext_pileof the original.
From the results plus the original entity_generation the one with the shortest URL (other criteria are possible if better ones are available) is picked as the one to be marked as the original, all other entity generations in the list get marked as duplicates.
Deriving Content
The text_pile, link_summaries and document_description are assembled from by fetching file content from the database and running an appropriate scraping algorithm.
Todo: Link scraping algorithms here.
The document_desciption indexiness is calculated from the link_summaries and document_desciption using the indexiness algorithm.
The file_summary is generated from the file_info, document_description, and text_pile.