Databse Batching (Pattern)

Batching up database queries is something you'll see a lot in unobtanium, but mostly in the summarizer wich should, in the best case process as many pages as possible as fast as possible.

The reason why it is faster boils down to SQL parsing and Query overhead, which is the same, no matter if the database query is used once, or 1024 times, so just let the database do the work once and then let the main logic do some work before calling out to the database again.

The methods on the database explicitly made for batching are suffixed _bulk, otherwise they have the same name as the regular version (if there is one). These methods usually take a Vec of data and then return a HashMap of data.

When something is implemented as batched the code alternates between building the list of arguments for the next database queries, possibly filtering them and doing the queries. Where it makes sense there are checks in place that make the algorithm abort early if all wirk has been filtered out (i.e. in the summarizer after checking for already integrated files).

Implementation Considerations

When btching make sure you know your data-flow and be aware that batching may interfere with how data is selected.

This may result in neccessary checks against the current batch in addition to checks against the database.

A case where batching lead to an undesired edge case is in the summarizing algorithm where the test for self-duplicates relies on testing against all already integrated data, but fails to check against the data that is in the pipeline which causes problems if the crawl results are shorter than the batch-size.