CLI Reference#

This document contains the help content for the warcat command-line program.

`warcat`#

WARC archive tool

Usage: warcat [OPTIONS] <COMMAND>

Subcommands:#

export — Decodes a WARC file to messages in a easier-to-process format such as JSON
import — Encodes a WARC file from messages in a format of the export subcommand
list — Provides a listing of the WARC records
get — Returns a single WARC record
extract — Extracts resources for casual viewing of the WARC contents
verify — Perform specification and integrity checks on WARC files
self — Self-installer and uninstaller

Options:#

-q, --quiet — Disable any progress messages.

Does not affect logging.
--log-level <LOG_LEVEL> — Filter log messages by level

Default value: off

Possible values: trace, debug, info, warn, error, off
--log-file <LOG_FILE> — Write log messages to the given file instead of standard error
--log-json — Write log messages as JSON sequences instead of a console logging format

`warcat export`#

Decodes a WARC file to messages in a easier-to-process format such as JSON

Usage: warcat export [OPTIONS]

Options:#

--input <INPUT> — Path to a WARC file

Default value: -
--compression <COMPRESSION> — Specify the compression format of the input WARC file

Default value: auto

Possible values:
- auto: Automatically detect the format by the filename extension
- none: No compression
- gzip: Gzip format (such as “.warc.gz” files)
- zstandard: Zstandard format (such as “.warc.zst” files)
--output <OUTPUT> — Path for the output messages

Default value: -
--format <FORMAT> — Format for the output messages

Default value: json-seq

Possible values:
- json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)
- jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)
- cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items
--no-block — Do not output block messages
--extract — Output extract messages

`warcat import`#

Encodes a WARC file from messages in a format of the export subcommand

Usage: warcat import [OPTIONS]

Options:#

--input <INPUT> — Path to the input messages

Default value: -
--format <FORMAT> — Format for the input messages

Default value: json-seq

Possible values:
- json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)
- jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)
- cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items
--output <OUTPUT> — Path of the output WARC file

Default value: -
--compression <COMPRESSION> — Compression format of the output WARC file

Default value: auto

Possible values:
- auto: Automatically detect the format by the filename extension
- none: No compression
- gzip: Gzip format (such as “.warc.gz” files)
- zstandard: Zstandard format (such as “.warc.zst” files)
--compression-level <COMPRESSION_LEVEL> — Level of compression for the output

Default value: high

Possible values:
- balanced: A balance between compression ratio and resource consumption
- high: Use a reasonably increased amount of resources to achieve a better compression ratio
- low: Fast and low resource usage, but lower compression ratio

`warcat list`#

Provides a listing of the WARC records

Usage: warcat list [OPTIONS]

Options:#

--input <INPUT> — Path of the WARC file

Default value: -
--compression <COMPRESSION> — Compression format of the input WARC file

Default value: auto

Possible values:
- auto: Automatically detect the format by the filename extension
- none: No compression
- gzip: Gzip format (such as “.warc.gz” files)
- zstandard: Zstandard format (such as “.warc.zst” files)
--output <OUTPUT> — Path to output listings

Default value: -
--format <FORMAT> — Format of the output

Default value: json-seq

Possible values:
- json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)
- jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)
- cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items
- csv: Comma separated values
--field <FIELD> — Fields to include in the listing.

The option accepts names of fields that occur in a WARC header.

The pseudo-name :position represents the position in the file. :file represents the path of the file.

Default value: :position,WARC-Record-ID,WARC-Type,Content-Type,WARC-Target-URI

`warcat get`#

Returns a single WARC record

Usage: warcat get <COMMAND>

Subcommands:#

export — Output export messages
extract — Extract a resource

`warcat get export`#

Output export messages

Usage: warcat get export [OPTIONS] --position <POSITION>

Options:#

--input <INPUT> — Path of the WARC file

Default value: -
--compression <COMPRESSION> — Compression format of the input WARC file

Default value: auto

Possible values:
- auto: Automatically detect the format by the filename extension
- none: No compression
- gzip: Gzip format (such as “.warc.gz” files)
- zstandard: Zstandard format (such as “.warc.zst” files)
--position <POSITION> — Position where the record is located in the input WARC file
--id <ID> — The ID of the record to extract
--output <OUTPUT> — Path for the output messages

Default value: -
--format <FORMAT> — Format for the output messages

Default value: json-seq

Possible values:
- json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)
- jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)
- cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items
--no-block — Do not output block messages
--extract — Output extract messages

`warcat get extract`#

Extract a resource

Usage: warcat get extract [OPTIONS] --position <POSITION>

Options:#

--input <INPUT>

Default value: -
--compression <COMPRESSION> — Compression format of the input WARC file

Default value: auto

Possible values:
- auto: Automatically detect the format by the filename extension
- none: No compression
- gzip: Gzip format (such as “.warc.gz” files)
- zstandard: Zstandard format (such as “.warc.zst” files)
--position <POSITION> — Position where the record is located in the input WARC file
--id <ID> — The ID of the record to extract
--output <OUTPUT> — Path for the output file

Default value: -

`warcat extract`#

Extracts resources for casual viewing of the WARC contents.

Files are extracted to a directory structure similar to the archived URL.

This operation does not automatically permit offline viewing of archived websites; no content conversion or link-rewriting is performed.

Usage: warcat extract [OPTIONS]

Options:#

--input <INPUT> — Path to the WARC file

Default value: -
--compression <COMPRESSION> — Compression format of the input WARC file

Default value: auto

Possible values:
- auto: Automatically detect the format by the filename extension
- none: No compression
- gzip: Gzip format (such as “.warc.gz” files)
- zstandard: Zstandard format (such as “.warc.zst” files)
--output <OUTPUT> — Path to the output directory

Default value: ./
--continue-on-error — Whether to ignore errors
--include <INCLUDE> — Select only records with a field.

Rule format is “NAME” or “NAME:VALUE”.
--include-pattern <INCLUDE_PATTERN> — Select only records matching a regular expression.

Rule format is “NAME:VALUEPATTERN”.
--exclude <EXCLUDE> — Do not select records with a field.

Rule format is “NAME” or “NAME:VALUE”.
--exclude-pattern <EXCLUDE_PATTERN> — Do not select records matching a regular expression.

Rule format is “NAME:VALUEPATTERN”.

`warcat verify`#

Perform specification and integrity checks on WARC files

Usage: warcat verify [OPTIONS]

Options:#

--input <INPUT> — Path to the WARC file

Default value: -
--compression <COMPRESSION> — Compression format of the input WARC file

Default value: auto

Possible values:
- auto: Automatically detect the format by the filename extension
- none: No compression
- gzip: Gzip format (such as “.warc.gz” files)
- zstandard: Zstandard format (such as “.warc.zst” files)
--output <OUTPUT> — Path to output problems

Default value: -
--format <FORMAT> — Format of the output

Default value: json-seq

Possible values:
- json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)
- jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)
- cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items
- csv: Comma separated values
--exclude-check <EXCLUDE_CHECK> — Do not perform check

Possible values: mandatory-fields, known-record-type, content-type, concurrent-to, block-digest, payload-digest, ip-address, refers-to, refers-to-target-uri, refers-to-date, target-uri, truncated, warcinfo-id, filename, profile, segment, record-at-time-compression
--database <DATABASE> — Database filename for storing temporary intermediate data

`warcat self`#

Self-installer and uninstaller

Usage: warcat self <COMMAND>

Subcommands:#

install — Launch the interactive self-installer
uninstall — Launch the interactive uninstaller

`warcat self install`#

Launch the interactive self-installer

Usage: warcat self install [OPTIONS]

Options:#

--quiet — Install automatically without user interaction

`warcat self uninstall`#

Launch the interactive uninstaller

Usage: warcat self uninstall [OPTIONS]

Options:#

--quiet — Uninstall automatically without user interaction

CLI Reference

Contents

CLI Reference#

warcat#

Subcommands:#

Options:#

warcat export#

Options:#

warcat import#

Options:#

warcat list#

Options:#

warcat get#

Subcommands:#

warcat get export#

Options:#

warcat get extract#

Options:#

warcat extract#

Options:#

warcat verify#

Options:#

warcat self#

Subcommands:#

warcat self install#

Options:#

warcat self uninstall#

Options:#

`warcat`#

`warcat export`#

`warcat import`#

`warcat list`#

`warcat get`#

`warcat get export`#

`warcat get extract`#

`warcat extract`#

`warcat verify`#

`warcat self`#

`warcat self install`#

`warcat self uninstall`#