The unobtanium-crawler collects data from the web and summarizes it.
It mainatains both the crawler and summary databases.
Synopsis
The crawler is split up into many subcommands.
unobtanium-crawler crawl [OPTIONS]...unobtanium-crawler summarize [OPTIONS]...unobtanium-crawler optimize-db --database <database-file>unobtanium-crawler regenerate-token-index [OPTIONS]...-
unobtanium-crawler delete [S-SUBCOMMAND]old-crawl-log-entries [OPTIONS]...
-
unobtanium-crawler debug [SUBCOMMAND]indexiness [OPTIONS]...query-crawl-log [OPTIONS]...sqlite-version
crawl
The crawl subcommand starts the Crawl Loop for a given crawl configuration.
Note: For testing the crawler can be configured with command line options only, however this setup isn't recommended for long term deployments.
Accepted options are:
-c,--database<file>- The crawler database file to store the crawl results in.
- If the file doesn't exist yet, it will be created.
-u,--user-agent<user_agent>- Set the user agent.
-
Overrides the
user_agentsetting from the configuration file. -w,--worker-name<name>- Set the worker name to be logged to the database.
- Default is
ant -m,--max-commands<number>- The maximum number of commands to process in this run
-
Overrides the
max_commands_per_runsetting from the configuration file. -d,--default-delay<milliseconds>- The default wait time between requests.
-
Overrides the
default_delay_mssetting from the configuration file. --schedule<url>- Manually schedule a seed URL.
-
Overrides the
seedssetting from the configuration file. --config<path>- Specify a path to the configuration file.
--policy-file<path>- Specify a path to a policy configuration file, it can contain additional policies with the same notation as in the configurtion file.
--force-recrawl- Ignore when a pages were last crawled and scedule them for recrawling immedeately.
-
Overrides the
recrawl_intervalsetting from the configuration file. --ignore-db-name-from-config-
Ignore the
database_namesetting from the configuration file.
summarize
The summarize subcommand takes a crawler database and integrates it into a summary database using the Summarizing algorithm.
Accepted options are:
-c,--crawler-db<file>- Database file of the crawler database.
-s,--summary-db<file>- Database file of the crawler database.
- If the file doesn't exist yet, it will be created.
optimize-db
Runs the SQLite internal analyze and optimize commands on the given crawler or summary database.
Accepted options are:
-c,--database<file>- The database file to optimize.
regenerate-token-index
Regenerates the experimental token based index for use with the token: filter.
Accepted options are:
-s,--summary-database<file>- The summary database to generate the token index for.
delete old-crawl-log-entries
Delete old entries from the crawl log in the crawler database along with their associated data.
Accepted options are:
-c,--crawler-db<file>- The crawler database to delete crawl log entries from.
--keep-latest<n>- How many of the latest entries for each page to keep.
--apply- Actually apply the deletion instead of running a simulation.
debug indexiness
Prints a breakdown of the indexiness calculation for a given page.
Accepted options are:
-c,--database<file>- The crawler database to fetch the source data fir the indexiness calculation from.
-u,--url<url>- The URL to run the calculation for, if there are multiple, the latest instance will be used.
debug query-crawl-log
Query entries from the crawl log in the crawler database.
Accepted options are:
-c,--crawler-db<file>- The crawler database to delete crawl log entries from.
--uuid<uuid>- Query by crawl log entry UUID.
--host<host>- Filter the results by hostname.
--url<url>- Filter the results by URL.
--exit-code<exit-code>- Filter the results by exit-code, both name and id are accepted.
debug sqlite-version
Prints the SQLite version that this version of unobtanium is using.
This command takes no options.