Manual: unobtanium-crawler

The unobtanium-crawler collected data from the web and summarized it.

It mainatains both the crawler and summary databases.

Synopsis

The crawler is split up into many subcommands.

crawl

The crawl subcommand starts the Crawl Loop for a given crawl configuration.

Note: For testing the crawler can be configured with command line options only, however this setup isn't recommended for long term deployments.

Accepted options are:

-c, --database <file>
The crawler database file to store the crawl results in.
If the file doesn't exist yet, it will be created.
-u, --user-agent <user_agent>
Set the user agent.
Overrides the user_agent setting from the configuration file.
-w, --worker-name <name>
Set the worker name to be logged to the database.
Default is ant
-m, --max-commands <number>
The maximum number of commands to process in this run
Overrides the max_commands_per_run setting from the configuration file.
-d, --default-delay <milliseconds>
The default wait time between requests.
Overrides the default_delay_ms setting from the configuration file.
--schedule <url>
Manually schedule a seed URL.
Overrides the seeds setting from the configuration file.
--config <path>
Specify a path to the configuration file.
--force-recrawl
Ignore when a pages were last crawled and scedule them for recrawling immedeately.
Overrides the recrawl_interval setting from the configuration file.
--ignore-db-name-from-config
Ignore the database_name setting from the configuration file.

summarize

The summarize subcommand takes a crawler database and integrates it into a summary database using the Summarizing algorithm.

Accepted options are:

-c, --crawler-db <file>
Database file of the crawler database.
-s, --summary-db <file>
Database file of the crawler database.
If the file doesn't exist yet, it will be created.

optimize-db

Runs the SQLite internal analyze and optimize commands on the given crawler or summary database.

Accepted options are:

-c, --database <file>
The database file to optimize.

regenerate-token-index

Regenerates the experimental token based index for use with the token: query.

Accepted options are:

-s, --summary-database <file>
The summary database to generate the token index for.

delete old-crawl-log-entries

Delete old entries from the crawl log in the crawler database along with their associated data.

Accepted options are:

-c, --crawler-db <file>
The crawler database to delete crawl log entries from.
--keep-latest <n>
How many of the latest entries for each page to keep.
--apply
Actually apply the deletion instead of running a simulation.

debug indexiness

Prints a breakdown of the indexiness calculation for a given page.

Accepted options are:

-c, --database <file>
The crawler database to fetch the source data fir the indexiness calculation from.
-u, --url <url>
The URL to run the calculation for, if there are multiple, the latest instance will be used.

debug query-crawl-log

Query entries from the crawl log in the crawler database.

Accepted options are:

-c, --crawler-db <file>
The crawler database to delete crawl log entries from.
--uuid <uuid>
Query by crawl log entry UUID.
--host <host>
Filter the results by hostname.
--url <url>
Filter the results by URL.

debug sqlite-version

Prints the SQLite version that this version of unobtanium is using.

This command takes no options.