CLI Reference#

This document contains the help content for the warcat command-line program.

warcat#

WARC archive tool

Usage: warcat [OPTIONS] <COMMAND>

Subcommands:#

  • export — Decodes a WARC file to messages in a easier-to-process format such as JSON

  • import — Encodes a WARC file from messages in a format of the export subcommand

  • list — Provides a listing of the WARC records

  • get — Returns a single WARC record

  • extract — Extracts resources for casual viewing of the WARC contents

  • verify — Perform specification and integrity checks on WARC files

  • self — Self-installer and uninstaller

Options:#

  • -q, --quiet — Disable any progress messages.

    Does not affect logging.

  • --log-level <LOG_LEVEL> — Filter log messages by level

    Default value: off

    Possible values: trace, debug, info, warn, error, off

  • --log-file <LOG_FILE> — Write log messages to the given file instead of standard error

  • --log-json — Write log messages as JSON sequences instead of a console logging format

warcat export#

Decodes a WARC file to messages in a easier-to-process format such as JSON

Usage: warcat export [OPTIONS]

Options:#

  • --input <INPUT> — Path to a WARC file

    Default value: -

  • --compression <COMPRESSION> — Specify the compression format of the input WARC file

    Default value: auto

    Possible values:

    • auto: Automatically detect the format by the filename extension

    • none: No compression

    • gzip: Gzip format (such as “.warc.gz” files)

    • zstandard: Zstandard format (such as “.warc.zst” files)

  • --output <OUTPUT> — Path for the output messages

    Default value: -

  • --format <FORMAT> — Format for the output messages

    Default value: json-seq

    Possible values:

    • json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)

    • jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)

    • cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items

  • --no-block — Do not output block messages

  • --extract — Output extract messages

warcat import#

Encodes a WARC file from messages in a format of the export subcommand

Usage: warcat import [OPTIONS]

Options:#

  • --input <INPUT> — Path to the input messages

    Default value: -

  • --format <FORMAT> — Format for the input messages

    Default value: json-seq

    Possible values:

    • json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)

    • jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)

    • cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items

  • --output <OUTPUT> — Path of the output WARC file

    Default value: -

  • --compression <COMPRESSION> — Compression format of the output WARC file

    Default value: auto

    Possible values:

    • auto: Automatically detect the format by the filename extension

    • none: No compression

    • gzip: Gzip format (such as “.warc.gz” files)

    • zstandard: Zstandard format (such as “.warc.zst” files)

  • --compression-level <COMPRESSION_LEVEL> — Level of compression for the output

    Default value: high

    Possible values:

    • balanced: A balance between compression ratio and resource consumption

    • high: Use a reasonably increased amount of resources to achieve a better compression ratio

    • low: Fast and low resource usage, but lower compression ratio

warcat list#

Provides a listing of the WARC records

Usage: warcat list [OPTIONS]

Options:#

  • --input <INPUT> — Path of the WARC file

    Default value: -

  • --compression <COMPRESSION> — Compression format of the input WARC file

    Default value: auto

    Possible values:

    • auto: Automatically detect the format by the filename extension

    • none: No compression

    • gzip: Gzip format (such as “.warc.gz” files)

    • zstandard: Zstandard format (such as “.warc.zst” files)

  • --output <OUTPUT> — Path to output listings

    Default value: -

  • --format <FORMAT> — Format of the output

    Default value: json-seq

    Possible values:

    • json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)

    • jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)

    • cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items

    • csv: Comma separated values

  • --field <FIELD> — Fields to include in the listing.

    The option accepts names of fields that occur in a WARC header.

    The pseudo-name :position represents the position in the file. :file represents the path of the file.

    Default value: :position,WARC-Record-ID,WARC-Type,Content-Type,WARC-Target-URI

warcat get#

Returns a single WARC record

Usage: warcat get <COMMAND>

Subcommands:#

  • export — Output export messages

  • extract — Extract a resource

warcat get export#

Output export messages

Usage: warcat get export [OPTIONS] --position <POSITION>

Options:#

  • --input <INPUT> — Path of the WARC file

    Default value: -

  • --compression <COMPRESSION> — Compression format of the input WARC file

    Default value: auto

    Possible values:

    • auto: Automatically detect the format by the filename extension

    • none: No compression

    • gzip: Gzip format (such as “.warc.gz” files)

    • zstandard: Zstandard format (such as “.warc.zst” files)

  • --position <POSITION> — Position where the record is located in the input WARC file

  • --id <ID> — The ID of the record to extract

  • --output <OUTPUT> — Path for the output messages

    Default value: -

  • --format <FORMAT> — Format for the output messages

    Default value: json-seq

    Possible values:

    • json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)

    • jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)

    • cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items

  • --no-block — Do not output block messages

  • --extract — Output extract messages

warcat get extract#

Extract a resource

Usage: warcat get extract [OPTIONS] --position <POSITION>

Options:#

  • --input <INPUT>

    Default value: -

  • --compression <COMPRESSION> — Compression format of the input WARC file

    Default value: auto

    Possible values:

    • auto: Automatically detect the format by the filename extension

    • none: No compression

    • gzip: Gzip format (such as “.warc.gz” files)

    • zstandard: Zstandard format (such as “.warc.zst” files)

  • --position <POSITION> — Position where the record is located in the input WARC file

  • --id <ID> — The ID of the record to extract

  • --output <OUTPUT> — Path for the output file

    Default value: -

warcat extract#

Extracts resources for casual viewing of the WARC contents.

Files are extracted to a directory structure similar to the archived URL.

This operation does not automatically permit offline viewing of archived websites; no content conversion or link-rewriting is performed.

Usage: warcat extract [OPTIONS]

Options:#

  • --input <INPUT> — Path to the WARC file

    Default value: -

  • --compression <COMPRESSION> — Compression format of the input WARC file

    Default value: auto

    Possible values:

    • auto: Automatically detect the format by the filename extension

    • none: No compression

    • gzip: Gzip format (such as “.warc.gz” files)

    • zstandard: Zstandard format (such as “.warc.zst” files)

  • --output <OUTPUT> — Path to the output directory

    Default value: ./

  • --continue-on-error — Whether to ignore errors

  • --include <INCLUDE> — Select only records with a field.

    Rule format is “NAME” or “NAME:VALUE”.

  • --include-pattern <INCLUDE_PATTERN> — Select only records matching a regular expression.

    Rule format is “NAME:VALUEPATTERN”.

  • --exclude <EXCLUDE> — Do not select records with a field.

    Rule format is “NAME” or “NAME:VALUE”.

  • --exclude-pattern <EXCLUDE_PATTERN> — Do not select records matching a regular expression.

    Rule format is “NAME:VALUEPATTERN”.

warcat verify#

Perform specification and integrity checks on WARC files

Usage: warcat verify [OPTIONS]

Options:#

  • --input <INPUT> — Path to the WARC file

    Default value: -

  • --compression <COMPRESSION> — Compression format of the input WARC file

    Default value: auto

    Possible values:

    • auto: Automatically detect the format by the filename extension

    • none: No compression

    • gzip: Gzip format (such as “.warc.gz” files)

    • zstandard: Zstandard format (such as “.warc.zst” files)

  • --output <OUTPUT> — Path to output problems

    Default value: -

  • --format <FORMAT> — Format of the output

    Default value: json-seq

    Possible values:

    • json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)

    • jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)

    • cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items

    • csv: Comma separated values

  • --exclude-check <EXCLUDE_CHECK> — Do not perform check

    Possible values: mandatory-fields, known-record-type, content-type, concurrent-to, block-digest, payload-digest, ip-address, refers-to, refers-to-target-uri, refers-to-date, target-uri, truncated, warcinfo-id, filename, profile, segment, record-at-time-compression

  • --database <DATABASE> — Database filename for storing temporary intermediate data

warcat self#

Self-installer and uninstaller

Usage: warcat self <COMMAND>

Subcommands:#

  • install — Launch the interactive self-installer

  • uninstall — Launch the interactive uninstaller

warcat self install#

Launch the interactive self-installer

Usage: warcat self install [OPTIONS]

Options:#

  • --quiet — Install automatically without user interaction

warcat self uninstall#

Launch the interactive uninstaller

Usage: warcat self uninstall [OPTIONS]

Options:#

  • --quiet — Uninstall automatically without user interaction