unobtanium-crawler

The unobtanium-crawler collects data from the web and summarizes it.

It mainatains both the crawler and summary databases.

Synopsis

The crawler is split up into many subcommands.

unobtanium-crawler crawl [OPTIONS]...
unobtanium-crawler summarize [OPTIONS]...
unobtanium-crawler optimize-db --database <database-file>
unobtanium-crawler regenerate-token-index [OPTIONS]...
unobtanium-crawler delete [S-SUBCOMMAND]
- old-crawl-log-entries [OPTIONS]...
unobtanium-crawler debug [SUBCOMMAND]
- indexiness [OPTIONS]...
- query-crawl-log [OPTIONS]...
- sqlite-version

The crawl subcommand starts the Crawl Loop for a given crawl configuration.

Note: For testing the crawler can be configured with command line options only, however this setup isn't recommended for long term deployments.

Accepted options are:

-c, --database <file>: The crawler database file to store the crawl results in.; If the file doesn't exist yet, it will be created.
-u, --user-agent <user_agent>: Set the user agent.; Overrides the user_agent setting from the configuration file.
-w, --worker-name <name>: Set the worker name to be logged to the database.; Default is ant
-m, --max-commands <number>: The maximum number of commands to process in this run; Overrides the max_commands_per_run setting from the configuration file.
-d, --default-delay <milliseconds>: The default wait time between requests.; Overrides the default_delay_ms setting from the configuration file.
--schedule <url>: Manually schedule a seed URL.; Overrides the seeds setting from the configuration file.
--config <path>: Specify a path to the configuration file.
--policy-file <path>: Specify a path to a policy configuration file, it can contain additional policies with the same notation as in the configurtion file.
--force-recrawl: Ignore when a pages were last crawled and scedule them for recrawling immedeately.; Overrides the recrawl_interval setting from the configuration file.
--ignore-db-name-from-config: Ignore the database_name setting from the configuration file.

The summarize subcommand takes a crawler database and integrates it into a summary database using the Summarizing algorithm.

Accepted options are:

-c, --crawler-db <file>: Database file of the crawler database.
-s, --summary-db <file>: Database file of the crawler database.; If the file doesn't exist yet, it will be created.

Runs the SQLite internal analyze and optimize commands on the given crawler or summary database.

Accepted options are:

Accepted options are:

-s, --summary-database <file>: The summary database to generate the token index for.

Delete old entries from the crawl log in the crawler database along with their associated data.

Accepted options are:

-c, --crawler-db <file>: The crawler database to delete crawl log entries from.
--keep-latest <n>: How many of the latest entries for each page to keep.
--apply: Actually apply the deletion instead of running a simulation.

Prints a breakdown of the indexiness calculation for a given page.

Accepted options are:

-c, --database <file>: The crawler database to fetch the source data fir the indexiness calculation from.
-u, --url <url>: The URL to run the calculation for, if there are multiple, the latest instance will be used.

Query entries from the crawl log in the crawler database.

Accepted options are:

-c, --crawler-db <file>: The crawler database to delete crawl log entries from.
--uuid <uuid>: Query by crawl log entry UUID.
--host <host>: Filter the results by hostname.
--url <url>: Filter the results by URL.
--exit-code <exit-code>: Filter the results by exit-code, both name and id are accepted.

Prints the SQLite version that this version of unobtanium is using.

This command takes no options.