Exit codes enumerate the outcome of a crawling reequest on a high level, they have fixed integers. Referring to them by name should be preferred, the integer representation is mainly for copact storage in databases.
Note: Some of those exit codes are from a time when unobtanium had a very different and more database coupled archtecture, they are unused now, but remain here to keep the number reserved.
| Id | Name | Description |
|---|---|---|
| -3 | database_error |
The crawl couldn't be finished because of a database error |
| -2 | cancelled |
The crawl was cancelled mid-way (it would also be okay to discard those) |
| -1 | someone_stole_my_work |
If a race condition was detected in a task queue |
| 20 | file_ingested |
A file was crawled and the content ingested |
| 29 | file_of_unknown_type |
A file was found at the requested location, but the crawler doesn't know what to do |
| 31 | permanent_redirect |
The server indicated a redirect and hited that it isn't going away soon. |
| 32 | redirect |
The server responded with a redirect, this may change in hte future |
| 34 | file_did_not_change |
The server communicated that the file did not change since the last request |
| 40 | server_blamed_client |
The server respose indicated that the client sent a request wrong, this includes authentication errors |
| 41 | file_gone |
The Server indicated that the requested resource isn't available and won't come back |
| 42 | did_not_understand_answer |
The crawler couldn't receive the servers answer because of a protocol error or encoding issue |
| 44 | file_not_found |
The server communicated that there is no resource at the requested location, this may change in the future |
| 49 | rate_limited |
The Server communicated that the crawler was going too fast |
| 50 | server_internal_error |
The Server couldn't answer because of an internal error |
| 100 | connection_failed |
Could not connect to the server |
| 101 | request_timeout |
The server tok too long to repond to the request. |
| 102 | error_reading_response |
A Problem occourred while reading the server response (i.e. unexpected connection termination) |
| 170 | blocked_by_robots_txt |
At the request of the Servers robots.txt the resource wasn't crawled |
| 171 | blocked_at_request_of_remote |
The server requested to not crawl the given resource |
| 172 | blocked_origin_by_local_policy |
Crawling was blocked by a local policy on the origin level |
| 173 | blocked_url_by_local_policy |
Crawling was blocked by a local policy on the url level |
| 174 | blocked_by_challenge |
The Server returned a challenge/captcha page of some sort |
| 180 | not_canonical |
The content was discarded because it marked itself as not being canonical |
| 181 | duplicate |
The resource was found to be a duplicate by the crawler or a post-processing stage |
| -999 | unknown_error |
Placehholder for Errors that don't have an exit code assigned yet. |
Redirects
The codes redirect and permanent_redirect represent redirects.
They may have meaningful metadata about the destination of the redirect attached.
Blocked
The codes blocked_by_robots_txt, blocked_at_request_of_remote, blocked_origin_by_local_policy, blocked_url_by_local_policy represent crawls where either no request was sent or the returned data was discarded because the resource indicated that it didn't want to be indexed.
The blocked_by_challenge indicates that the crawler ran into a challenge/captcha page that is likely intended to keep bad bots out, this is unfortunate and not a clear signal. (The crawler should be stopped robots.txt before letting it run into a challenge page)
Could be a Fluke
The codes unknown_error, connection_failed, request_timeout and error_reading_response all can have temporary networking problems as a possible cause and are therefore worth retrying immedeately once.
See the Fluke Event concept for more information.
Contentful
The following codes are considered being contentful:
file_ingestedfile_of_unknown_typepermanent_redirectredirectfile_not_foundfile_gonenot_canonicalduplicate
These have in common that from this request the crawler explicitly learned something about a resource or its absence.
The file_did_not_change is excluded as the crawler explicitly didn't learn anything new from that.