Crawl exit code

Exit codes enumerate the outcome of a crawling reequest on a high level, they have fixed integers. Referring to them by name should be preferred, the integer representation is mainly for copact storage in databases.

Note: Some of those exit codes are from a time when unobtanium had a very different and more database coupled archtecture, they are unused now, but remain here to keep the number reserved.

Id	Name	Description
-3	`database_error`	The crawl couldn't be finished because of a database error
-2	`cancelled`	The crawl was cancelled mid-way (it would also be okay to discard those)
-1	`someone_stole_my_work`	If a race condition was detected in a task queue
20	`file_ingested`	A file was crawled and the content ingested
29	`file_of_unknown_type`	A file was found at the requested location, but the crawler doesn't know what to do
31	`permanent_redirect`	The server indicated a redirect and hited that it isn't going away soon.
32	`redirect`	The server responded with a redirect, this may change in hte future
34	`file_did_not_change`	The server communicated that the file did not change since the last request
40	`server_blamed_client`	The server respose indicated that the client sent a request wrong, this includes authentication errors
41	`file_gone`	The Server indicated that the requested resource isn't available and won't come back
42	`did_not_understand_answer`	The crawler couldn't receive the servers answer because of a protocol error or encoding issue
44	`file_not_found`	The server communicated that there is no resource at the requested location, this may change in the future
49	`rate_limited`	The Server communicated that the crawler was going too fast
50	`server_internal_error`	The Server couldn't answer because of an internal error
100	`connection_failed`	Could not connect to the server
101	`request_timeout`	The server tok too long to repond to the request.
102	`error_reading_response`	A Problem occourred while reading the server response (i.e. unexpected connection termination)
170	`blocked_by_robots_txt`	At the request of the Servers `robots.txt` the resource wasn't crawled
171	`blocked_at_request_of_remote`	The server requested to not crawl the given resource
172	`blocked_origin_by_local_policy`	Crawling was blocked by a local policy on the origin level
173	`blocked_url_by_local_policy`	Crawling was blocked by a local policy on the url level
174	`blocked_by_challange`	The Server returned a challange/captcha page of some sort
180	`not_canonical`	The content was discarded because it marked itself as not being canonical
181	`duplicate`	The resource was found to be a duplicate by the crawler or a post-processing stage
-999	`unknown_error`	Placehholder for Errors that don't have an exit code assigned yet.

Redirects

The codes redirect and permanent_redirect represent redirects.

They may have meaningful metadata about the destination of the redirect attached.

Blocked

The codes blocked_by_robots_txt, blocked_at_request_of_remote, blocked_origin_by_local_policy, blocked_url_by_local_policy represent crawls where either no request was sent or the returned data was discarded because the resource indicated that it didn't want to be indexed.

The blocked_by_challange indicates that the crawler ran into a challange/captcha page that is likely intended to keep bad bots out, this is unfortunate and not a clear signal. (The crawler should be stopped robots.txt before letting it run into a challange page)

Could be a Fluke

The codes unknown_error, connection_failed, request_timeout and error_reading_response all can have temporary networking problems as a possible cause and are therefore worth retrying immedeately once.

See the Fluke Event concept for more information.

Contentful

The following codes are considered being contentful:

file_ingested
file_of_unknown_type
permanent_redirect
redirect
file_not_found
file_gone
not_canonical
duplicate

These have in common that from this request the crawler explicitly learned something about a resource or its absence.

The file_did_not_change is excluded as the crawler explicitly didn't learn anything new from that.