Data: Crawl exit code

Exit codes enumerate the outcome of a crawling reequest on a high level, they have fixed integers. Referring to them by name should be preferred, the integer representation is mainly for copact storage in databases.

Note: Some of those exit codes are from a time when unobtanium had a very different and more database coupled archtecture, they are unused now, but remain here to keep the number reserved.

Id Name Description
-3 database_error The crawl couldn't be finished because of a database error
-2 cancelled The crawl was cancelled mid-way (it would also be okay to discard those)
-1 someone_stole_my_work If a race condition was detected in a task queue
20 file_ingested A file was crawled and the content ingested
29 file_of_unknown_type A file was found at the requested location, but the crawler doesn't know what to do
31 permanent_redirect The server indicated a redirect and hited that it isn't going away soon.
32 redirect The server responded with a redirect, this may change in hte future
34 file_did_not_change The server communicated that the file did not change since the last request
40 server_blamed_client The server respose indicated that the client sent a request wrong, this includes authentication errors
41 file_gone The Server indicated that the requested resource isn't available and won't come back
42 did_not_understand_answer The crawler couldn't receive the servers answer because of a protocol error or encoding issue
44 file_not_found The server communicated that there is no resource at the requested location, this may change in the future
49 rate_limited The Server communicated that the crawler was going too fast
50 server_internal_error The Server couldn't answer because of an internal error
100 connection_failed Could not connect to the server
101 request_timeout The server tok too long to repond to the request.
102 error_reading_response A Problem occourred while reading the server response (i.e. unexpected connection termination)
170 blocked_by_robots_txt At the request of the Servers robots.txt the resource wasn't crawled
171 blocked_at_request_of_remote The server requested to not crawl the given resource
172 blocked_origin_by_local_policy Crawling was blocked by a local policy on the origin level
173 blocked_url_by_local_policy Crawling was blocked by a local policy on the url level
180 not_canonical The content was discarded because it marked itself as not being canonical
181 duplicate The resource was found to be a duplicate by the crawler or a post-processing stage
-999 unknown_error Placehholder for Errors that don't have an exit code assigned yet.

Redirects

The codes redirect and permanent_redirect represent redirects.

They may have meaningful metadata about the destination of the redirect attached.

Blocked

The codes blocked_by_robots_txt, blocked_at_request_of_remote, blocked_origin_by_local_policy, blocked_url_by_local_policy represent crawls where either no request was sent or the returned data was discarded because the resource indicated that it didn't want to be indexed.

Could be a Fluke

The codes unknown_error, connection_failed, request_timeout and error_reading_response all can have temporary networking problems as a possible cause and are therefore worth retrying immedeately once.

See the Fluke Event concept for more information.

Contentful

The following codes are considered being contentful:

These have in common that from this request the crawler explicitly learned something about a resource or its absence.

The file_did_not_change is excluded as the crawler explicitly didn't learn anything new from that.