The crawl log is an append only table in the crawler database, that stores when which url was crawled, how long that crawl took and what the outcome was.
The crawl log has the following fields:
crawl_log_id
- Integer id of the crawl log entry, only unique within the scope of the database
- Also see
crawl_uuid
. agent_id
- Integer id of the agent that did the request
- TODO: document agents and add a link here
url
- The primary url that was requested.
crawl_type
- Integer indicating which kind of crawl happened
crawl_uuid
- UUID to identify the crawl across databases
time_started
- The time the processing of this crawl request started, usually in unix time utc.
time_taken_ms
- Time the crawling of this URL has taken in milliseconds, this is not only the request but also includes processing.
exit_code
- Integer exit code, that documents the rough outcome of the crawl.
message
- A text message to the developer or administrator for when things went wrong.
Crawl Types
Id | Name | Description |
---|---|---|
0 | file_crawl |
File crawl for general purpose indexing |
1 | robotstxt_fetch |
Fetching of robots.txt |
2 | metadata_crawl |
HEAD only request no content was indexded |