The unobtanium-crawler
collected data from the web and summarized it.
It mainatains both the crawler and summary databases.
Synopsis
The crawler is split up into many subcommands.
unobtanium-crawler crawl [OPTIONS]...
unobtanium-crawler summarize [OPTIONS]...
unobtanium-crawler optimize-db --database <database-file>
unobtanium-crawler regenerate-token-index [OPTIONS]...
-
unobtanium-crawler delete [S-SUBCOMMAND]
old-crawl-log-entries [OPTIONS]...
-
unobtanium-crawler debug [SUBCOMMAND]
indexiness [OPTIONS]...
query-crawl-log [OPTIONS]...
sqlite-version
crawl
The crawl
subcommand starts the Crawl Loop for a given crawl configuration.
Note: For testing the crawler can be configured with command line options only, however this setup isn't recommended for long term deployments.
Accepted options are:
-c
,--database
<file>- The crawler database file to store the crawl results in.
- If the file doesn't exist yet, it will be created.
-u
,--user-agent
<user_agent>- Set the user agent.
-
Overrides the
user_agent
setting from the configuration file. -w
,--worker-name
<name>- Set the worker name to be logged to the database.
- Default is
ant
-m
,--max-commands
<number>- The maximum number of commands to process in this run
-
Overrides the
max_commands_per_run
setting from the configuration file. -d
,--default-delay
<milliseconds>- The default wait time between requests.
-
Overrides the
default_delay_ms
setting from the configuration file. --schedule
<url>- Manually schedule a seed URL.
-
Overrides the
seeds
setting from the configuration file. --config
<path>- Specify a path to the configuration file.
--force-recrawl
- Ignore when a pages were last crawled and scedule them for recrawling immedeately.
-
Overrides the
recrawl_interval
setting from the configuration file. --ignore-db-name-from-config
-
Ignore the
database_name
setting from the configuration file.
summarize
The summarize subcommand takes a crawler database and integrates it into a summary database using the Summarizing algorithm.
Accepted options are:
-c
,--crawler-db
<file>- Database file of the crawler database.
-s
,--summary-db
<file>- Database file of the crawler database.
- If the file doesn't exist yet, it will be created.
optimize-db
Runs the SQLite internal analyze and optimize commands on the given crawler or summary database.
Accepted options are:
-c
,--database
<file>- The database file to optimize.
regenerate-token-index
Regenerates the experimental token based index for use with the token:
query.
Accepted options are:
-s
,--summary-database
<file>- The summary database to generate the token index for.
delete old-crawl-log-entries
Delete old entries from the crawl log in the crawler database along with their associated data.
Accepted options are:
-c
,--crawler-db
<file>- The crawler database to delete crawl log entries from.
--keep-latest
<n>- How many of the latest entries for each page to keep.
--apply
- Actually apply the deletion instead of running a simulation.
debug indexiness
Prints a breakdown of the indexiness calculation for a given page.
Accepted options are:
-c
,--database
<file>- The crawler database to fetch the source data fir the indexiness calculation from.
-u
,--url
<url>- The URL to run the calculation for, if there are multiple, the latest instance will be used.
debug query-crawl-log
Query entries from the crawl log in the crawler database.
Accepted options are:
-c
,--crawler-db
<file>- The crawler database to delete crawl log entries from.
--uuid
<uuid>- Query by crawl log entry UUID.
--host
<host>- Filter the results by hostname.
--url
<url>- Filter the results by URL.
debug sqlite-version
Prints the SQLite version that this version of unobtanium is using.
This command takes no options.