Exit codes enumerate the outcome of a crawling reequest on a high level, they have fixed integers. Referring to them by name should be preferred, the integer representation is mainly for copact storage in databases.
Note: Some of those exit codes are from a time when unobtanium had a very different and more database coupled archtecture, they are unused now, but remain here to keep the number reserved.
Id | Name | Description |
---|---|---|
-3 | database_error |
The crawl couldn't be finished because of a database error |
-2 | cancelled |
The crawl was cancelled mid-way (it would also be okay to discard those) |
-1 | someone_stole_my_work |
If a race condition was detected in a task queue |
20 | file_ingested |
A file was crawled and the content ingested |
29 | file_of_unknown_type |
A file was found at the requested location, but the crawler doesn't know what to do |
31 | permanent_redirect |
The server indicated a redirect and hited that it isn't going away soon. |
32 | redirect |
The server responded with a redirect, this may change in hte future |
34 | file_did_not_change |
The server communicated that the file did not change since the last request |
40 | server_blamed_client |
The server respose indicated that the client sent a request wrong, this includes authentication errors |
41 | file_gone |
The Server indicated that the requested resource isn't available and won't come back |
42 | did_not_understand_answer |
The crawler couldn't receive the servers answer because of a protocol error or encoding issue |
44 | file_not_found |
The server communicated that there is no resource at the requested location, this may change in the future |
49 | rate_limited |
The Server communicated that the crawler was going too fast |
50 | server_internal_error |
The Server couldn't answer because of an internal error |
100 | connection_failed |
Could not connect to the server |
101 | request_timeout |
The server tok too long to repond to the request. |
102 | error_reading_response |
A Problem occourred while reading the server response (i.e. unexpected connection termination) |
170 | blocked_by_robots_txt |
At the request of the Servers robots.txt the resource wasn't crawled |
171 | blocked_at_request_of_remote |
The server requested to not crawl the given resource |
172 | blocked_origin_by_local_policy |
Crawling was blocked by a local policy on the origin level |
173 | blocked_url_by_local_policy |
Crawling was blocked by a local policy on the url level |
180 | not_canonical |
The content was discarded because it marked itself as not being canonical |
181 | duplicate |
The resource was found to be a duplicate by the crawler or a post-processing stage |
-999 | unknown_error |
Placehholder for Errors that don't have an exit code assigned yet. |
Redirects
The codes redirect
and permanent_redirect
represent redirects.
They may have meaningful metadata about the destination of the redirect attached.
Blocked
The codes blocked_by_robots_txt
, blocked_at_request_of_remote
, blocked_origin_by_local_policy
, blocked_url_by_local_policy
represent crawls where either no request was sent or the returned data was discarded because the resource indicated that it didn't want to be indexed.
Could be a Fluke
The codes unknown_error
, connection_failed
, request_timeout
and error_reading_response
all can have temporary networking problems as a possible cause and are therefore worth retrying immedeately once.
See the Fluke Event concept for more information.
Contentful
The following codes are considered being contentful:
file_ingested
file_of_unknown_type
permanent_redirect
redirect
file_not_found
file_gone
not_canonical
duplicate
These have in common that from this request the crawler explicitly learned something about a resource or its absence.
The file_did_not_change
is excluded as the crawler explicitly didn't learn anything new from that.