The summarizing algorithm takes a crawl database and turns it into the summary database, this algorithm processes all files that are available for search later.
This is done by iiterating all crawl log entries and then deciding what to do with them.
For easier understanding the algorithm is described as if every file was processed individually, in reality it is batched.
The algorihm is implemented in summarizer.rs as part of the crawler.
Steps
Fetch file_info
from the crawl database.
Fetch the corresponding crawl_log_entry
Test if the summary database already has crawl summary matching the the crawl_log_entry.uuid
, if yes that file has alredy been summarized in a previous run and the algorithm skips the file.
Self-Duplicate Detection
Note: At this point a lot of independet things will be started concurrently. This is mostly done to keep the batch iteration count low.
Derive the file_summary
(text_pile
+ document_desciption
) and link_summaries
. See Deriving Content.
If the summary database already contains a self duplicate Entity generation:
- Use its UUID as the
entity_generation_uuid
- Add it to the mapping from
file_id
toentity_generation_uuids
Otherwise, not a self-duplicate:
- Generate a new
entity_generation_uuid
- Remember URL and text pile digest for duplicate detecction later
- Remember
file_summary
andlink_summaries
for adding to the database later.
Endif
Generate Entity Generations
Fetch request_info
for the file.
If not a self duplicate:
Generate a new entity_generation
from the data collected so far:
url
- taken from
file_info
uuid
- The
entity_generation_uuid
generated earlier. first_seen
- Taken from when the request was started from
request_info
last_seen
- Same as
first_seen
time_end_confirmed
- Set to
None
marked_duplicate
- Set to
false
(innocent until proven guilty)
Store entity_generation
into the database.
Note: At this point things in the database can be connected with the entity_generation_uuid
in the database.
Close any old Entity generation based on URL and the first_seen
time.
Endif
Store file_summary
and link_summaries
into the database.
Integrate Crawl Information
If there is a request, associated with the file:
The crawl_summary
is generated from the information in request_info
and crawl_log_entry
.
crawl_time
- taken from
request_info
time sent crawl_uuid
- taken from
request_info
agent_uuid
- resolved from the
crawl_log_entry
crawl_type
- taken from the
crawl_log_entry
was_robotstxt_approved
- taken from
request_info
server_last_modified
- taken from
request_info
exit_code
- taken from
request_info
request_duration_ms
- taken from
request_info
http
- taken from
request_info
Store the crawl_summary
into the database.
Endif
Finding duplicates
To find duplicates query the database for entity generations that match the following criteria:
- Must not have a confirmed end time.
- Must be from the same origin as the original
entity_generation
(as a proxy for being on the same website) - Must have the same
text_pile
digest as thetext_pile
of the original.
From the results plus the original entity_generation
the one with the shortest URL (other criteria are possible if better ones are available) is picked as the one to be marked as the original, all other entity generations in the list get marked as duplicates.
Deriving Content
The text_pile
, link_summaries
and document_description
are assembled from by fetching file content from the database and running an appropriate scraping algorithm.
Todo: Link scraping algorithms here.
The document_desciption
indexiness is calculated from the link_summaries
and document_desciption
using the indexiness algorithm.
The file_summary
is generated from the file_info
, document_description
, and text_pile
.