CLI Reference#
This document contains the help content for the warcat command-line program.
warcat#
WARC archive tool
Usage: warcat [OPTIONS] <COMMAND>
Subcommands:#
export— Decodes a WARC file to messages in a easier-to-process format such as JSONimport— Encodes a WARC file from messages in a format of theexportsubcommandlist— Provides a listing of the WARC recordsget— Returns a single WARC recordextract— Extracts resources for casual viewing of the WARC contentsverify— Perform specification and integrity checks on WARC filesself— Self-installer and uninstaller
Options:#
-q,--quiet— Disable any progress messages.Does not affect logging.
--log-level <LOG_LEVEL>— Filter log messages by levelDefault value:
offPossible values:
trace,debug,info,warn,error,off--log-file <LOG_FILE>— Write log messages to the given file instead of standard error--log-json— Write log messages as JSON sequences instead of a console logging format
warcat export#
Decodes a WARC file to messages in a easier-to-process format such as JSON
Usage: warcat export [OPTIONS]
Options:#
--input <INPUT>— Path to a WARC fileDefault value:
---compression <COMPRESSION>— Specify the compression format of the input WARC fileDefault value:
autoPossible values:
auto: Automatically detect the format by the filename extensionnone: No compressiongzip: Gzip format (such as “.warc.gz” files)zstandard: Zstandard format (such as “.warc.zst” files)
--output <OUTPUT>— Path for the output messagesDefault value:
---format <FORMAT>— Format for the output messagesDefault value:
json-seqPossible values:
json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items
--no-block— Do not output block messages--extract— Output extract messages
warcat import#
Encodes a WARC file from messages in a format of the export subcommand
Usage: warcat import [OPTIONS]
Options:#
--input <INPUT>— Path to the input messagesDefault value:
---format <FORMAT>— Format for the input messagesDefault value:
json-seqPossible values:
json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items
--output <OUTPUT>— Path of the output WARC fileDefault value:
---compression <COMPRESSION>— Compression format of the output WARC fileDefault value:
autoPossible values:
auto: Automatically detect the format by the filename extensionnone: No compressiongzip: Gzip format (such as “.warc.gz” files)zstandard: Zstandard format (such as “.warc.zst” files)
--compression-level <COMPRESSION_LEVEL>— Level of compression for the outputDefault value:
highPossible values:
balanced: A balance between compression ratio and resource consumptionhigh: Use a reasonably increased amount of resources to achieve a better compression ratiolow: Fast and low resource usage, but lower compression ratio
warcat list#
Provides a listing of the WARC records
Usage: warcat list [OPTIONS]
Options:#
--input <INPUT>— Path of the WARC fileDefault value:
---compression <COMPRESSION>— Compression format of the input WARC fileDefault value:
autoPossible values:
auto: Automatically detect the format by the filename extensionnone: No compressiongzip: Gzip format (such as “.warc.gz” files)zstandard: Zstandard format (such as “.warc.zst” files)
--output <OUTPUT>— Path to output listingsDefault value:
---format <FORMAT>— Format of the outputDefault value:
json-seqPossible values:
json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data itemscsv: Comma separated values
--field <FIELD>— Fields to include in the listing.The option accepts names of fields that occur in a WARC header.
The pseudo-name
:positionrepresents the position in the file.:filerepresents the path of the file.Default value:
:position,WARC-Record-ID,WARC-Type,Content-Type,WARC-Target-URI
warcat get#
Returns a single WARC record
Usage: warcat get <COMMAND>
Subcommands:#
export— Output export messagesextract— Extract a resource
warcat get export#
Output export messages
Usage: warcat get export [OPTIONS] --position <POSITION>
Options:#
--input <INPUT>— Path of the WARC fileDefault value:
---compression <COMPRESSION>— Compression format of the input WARC fileDefault value:
autoPossible values:
auto: Automatically detect the format by the filename extensionnone: No compressiongzip: Gzip format (such as “.warc.gz” files)zstandard: Zstandard format (such as “.warc.zst” files)
--position <POSITION>— Position where the record is located in the input WARC file--id <ID>— The ID of the record to extract--output <OUTPUT>— Path for the output messagesDefault value:
---format <FORMAT>— Format for the output messagesDefault value:
json-seqPossible values:
json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data items
--no-block— Do not output block messages--extract— Output extract messages
warcat get extract#
Extract a resource
Usage: warcat get extract [OPTIONS] --position <POSITION>
Options:#
--input <INPUT>Default value:
---compression <COMPRESSION>— Compression format of the input WARC fileDefault value:
autoPossible values:
auto: Automatically detect the format by the filename extensionnone: No compressiongzip: Gzip format (such as “.warc.gz” files)zstandard: Zstandard format (such as “.warc.zst” files)
--position <POSITION>— Position where the record is located in the input WARC file--id <ID>— The ID of the record to extract--output <OUTPUT>— Path for the output fileDefault value:
-
warcat extract#
Extracts resources for casual viewing of the WARC contents.
Files are extracted to a directory structure similar to the archived URL.
This operation does not automatically permit offline viewing of archived websites; no content conversion or link-rewriting is performed.
Usage: warcat extract [OPTIONS]
Options:#
--input <INPUT>— Path to the WARC fileDefault value:
---compression <COMPRESSION>— Compression format of the input WARC fileDefault value:
autoPossible values:
auto: Automatically detect the format by the filename extensionnone: No compressiongzip: Gzip format (such as “.warc.gz” files)zstandard: Zstandard format (such as “.warc.zst” files)
--output <OUTPUT>— Path to the output directoryDefault value:
./--continue-on-error— Whether to ignore errors--include <INCLUDE>— Select only records with a field.Rule format is “NAME” or “NAME:VALUE”.
--include-pattern <INCLUDE_PATTERN>— Select only records matching a regular expression.Rule format is “NAME:VALUEPATTERN”.
--exclude <EXCLUDE>— Do not select records with a field.Rule format is “NAME” or “NAME:VALUE”.
--exclude-pattern <EXCLUDE_PATTERN>— Do not select records matching a regular expression.Rule format is “NAME:VALUEPATTERN”.
warcat verify#
Perform specification and integrity checks on WARC files
Usage: warcat verify [OPTIONS]
Options:#
--input <INPUT>— Path to the WARC fileDefault value:
---compression <COMPRESSION>— Compression format of the input WARC fileDefault value:
autoPossible values:
auto: Automatically detect the format by the filename extensionnone: No compressiongzip: Gzip format (such as “.warc.gz” files)zstandard: Zstandard format (such as “.warc.zst” files)
--output <OUTPUT>— Path to output problemsDefault value:
---format <FORMAT>— Format of the outputDefault value:
json-seqPossible values:
json-seq: JSON sequences (RFC 7464). Each message is a JSON object delimitated by a Record Separator (U+001E) and a Line Feed (U+000A)jsonl: JSON Lines. Each message is a JSON object terminated by a Line Feed (U+000A)cbor-seq: CBOR sequences (RFC 8742). Messages are a series of consecutive CBOR data itemscsv: Comma separated values
--exclude-check <EXCLUDE_CHECK>— Do not perform checkPossible values:
mandatory-fields,known-record-type,content-type,concurrent-to,block-digest,payload-digest,ip-address,refers-to,refers-to-target-uri,refers-to-date,target-uri,truncated,warcinfo-id,filename,profile,segment,record-at-time-compression--database <DATABASE>— Database filename for storing temporary intermediate data
warcat self#
Self-installer and uninstaller
Usage: warcat self <COMMAND>
Subcommands:#
install— Launch the interactive self-installeruninstall— Launch the interactive uninstaller
warcat self install#
Launch the interactive self-installer
Usage: warcat self install [OPTIONS]
Options:#
--quiet— Install automatically without user interaction
warcat self uninstall#
Launch the interactive uninstaller
Usage: warcat self uninstall [OPTIONS]
Options:#
--quiet— Uninstall automatically without user interaction