Patience and Fluke Events

Part of why the internet works so well most of the time is because it isn't perfect. Unfortunately that implies that sometimes it won't work, for no immediately apparent reason.

Given that the index on unobtanium.rocks outgrew 100k pages it is almost guaranteed that while crawling some requests don't make it for networking reasons or because of outages.

How a Human Would Handle the Internet not Working

Assume that as a Human one goes to a website and … it doesn't load.

Since thinking is expensive one probably intuitive hits the reload button, if it works now everything is fine, we got what we came for.

If it didn't work, we give up and maybe switch to problem solving or try another website, if we think that our connection is working.

What the Crawler does

The unobtanium crawler is pretty stupid, it doesn't know how to solve networking problems, so the best it can do it categorise the problems in when someone else could have solved them.

Fluke Events

For things like a timeout or connection problem it could be possible that that was just the Internet swallowing a the wrong packet and retrying is worth a shot, this is called a fluke event.

If the retry works the crawler discards the error, notes a "fluke recovery" and continues with the successful result.

If not the crawler notes a "fluke miss" and records that an error happened.

Patience

For errors that indicate a larger problem if they happen too often the crawler keeps a patience counter that starts at some positive number and is counted down when such en error happens.

For fatal errors like the database not working the patience counter is immediately short circuited to 0.

When the counter hits 0 the crawler gives up crawling the origin that is throwing the errors until it is restarted for the next time.

Interaction between Fluke Events and Patience

Fluke events and patience interact in a kind of circular way.

Fluke Events prevent little problems from running the patience counter low while the patience counter makes sure the crawler isn't wasting an endless amount of requests on retrying when there is a larger problem.

These mechanisms are both likely to be tweaked in the future to make them better at dealing with the Internet's small and big problems.