Data: Crawl Log

The crawl log is an append only table in the crawler database, that stores when which url was crawled, how long that crawl took and what the outcome was.

The crawl log has the following fields:

crawl_log_id
Integer id of the crawl log entry, only unique within the scope of the database
Also see crawl_uuid.
agent_id
Integer id of the agent that did the request
TODO: document agents and add a link here
url
The primary url that was requested.
crawl_type
Integer indicating which kind of crawl happened
crawl_uuid
UUID to identify the crawl across databases
time_started
The time the processing of this crawl request started, usually in unix time utc.
time_taken_ms
Time the crawling of this URL has taken in milliseconds, this is not only the request but also includes processing.
exit_code
Integer exit code, that documents the rough outcome of the crawl.
message
A text message to the developer or administrator for when things went wrong.

Crawl Types

Id Name Description
0 file_crawl File crawl for general purpose indexing
1 robotstxt_fetch Fetching of robots.txt
2 metadata_crawl HEAD only request no content was indexded