# Documentation for libzpaq # # Copyright (C) 2012, Dell Inc. Written by Matt Mahoney. # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so without restriction. # This Software is provided "as is" without warranty. # # To create man page: pod2man libzpaq.3.pod > libzpaq.3 # To create HTML documentation: pod2html libzpaq.3.pod > libzpaq.html =pod =head1 NAME libzpaq - ZPAQ compression API =head1 SYNOPSIS #include "libzpaq.h" namespace libzpaq { extern void error(const char* msg); class Reader { public: virtual int get() = 0; virtual int read(char* buf, int n); // optional virtual ~Reader() {} }; class Writer { public: virtual void put(int c) = 0; virtual void write(const char* buf, int n); // optional virtual ~Writer() {} }; class SHA1 { public: SHA1(); void put(int c); double size() const; uint64_t usize() const const char* result(); }; class Compressor { public: Compressor(); void setOutput(Writer* out); void writeTag(); void startBlock(int level); void startBlock(const char* hcomp); void startSegment(const char* filename = 0, const char* comment = 0); void setInput(Reader* i); void postProcess(const char* pcomp = 0, int length = 0); bool compress(int n = -1); void endSegment(const char* sha1string = 0); void endBlock(); }; class Decompresser { public: Decompresser(); void setInput(Reader* in); bool findBlock(double* memptr = 0); void hcomp(Writer* out); bool findFilename(Writer* = 0); void readComment(Writer* = 0); void setOutput(Writer* out); void setSHA1(SHA1* sha1ptr); bool decompress(int n = -1); bool pcomp(Writer* out); void readSegmentEnd(char* sha1string = 0); }; void compress(Reader* in, Writer* out, int level); void decompress(Reader* in, Writer* out); } =head1 DESCRIPTION I is a C++ API for compressing or decompressing files or objects in memory comforming to the ZPAQ level 1 and 2 standards (see I). This document describes version 5.00 of the software. The software may be used without restriction under a modified MIT license. ZPAQ provides a high level of data compression in a streaming (single pass) self-describing format that supports single or multiple named objects (such as archives) with optional integrity checking. The library provides 3 default compression levels but supports custom algorithms. The performance of the default levels is shown in the table below for the 14 file Calgary corpus as a tar file. Compression and decompression times are in seconds on a 2 GHz T3200 running on one of two cores. Memory required to compress or decompress is in MB. Some popular formats are shown for comparison. Program Format Size Time (C, D) Memory ----------- ------ --------- ----------- ------ Uncompresed .tar 3,152,896 compress .tar.Z 1,319,521 1.6 0.2 .1 MB gzip -9 .tar.gz 1,022,810 0.7 0.1 .1 MB bzip2 -9 .tar.bz2 860,097 0.6 0.4 5 MB 7zip .tar.7z 824,573 1.5 0.1 195 MB zpaq 1 (fast) .tar.zpaq 806,959 2 2 38 MB zpaq 2 (mid) .tar.zpaq 699,191 8 8 112 MB zpaq 3 (max) .tar.zpaq 644,190 20 20 246 MB A ZPAQ stream consists of one or more blocks, possibly mixed with other data, that can be decompressed independently in any order. Each block consists of one or more segments that must be decompressed in order from the beginning of the block. Each block header contains a description of the decompression algorithm. Each segment consists of an optional filename string, an optional comment string, self delimiting compressed data, and an optional SHA-1 checksum. If ZPAQ blocks are mixed with other data, they must be preceded by an identifying 13 byte tag which does not otherwise appear in that data. ZPAQ compression is based on the PAQ context mixing model. An array of components predict the probability of the next bit of input, either independently or depending on the predictions of earlier components. The final prediction is arithmetic coded. Each component inputs a context computed from earlier input by a program written in ZPAQL byte code which runs on a virtual machine. Both the component array description and the ZPAQL code are encoded in a string called HCOMP in each block header. Data can also be stored uncompressed. A block may optionally specify a post-processor, a program (also in ZPAQL) which takes the decoded data as input and outputs the decompressed output. This program, if present, is encoded as a string called PCOMP which is compressed in the first segment prior to the compressed data. The first decoded byte from the first segment is a flag indicating whether a PCOMP string is present. The user is responsible for correctly pre-processing the data so that post-processing restores the original data. =head2 API Organization The I API consists of 2 files. =over =item libzpaq.h Header file to include in your application. =item libzpaq.cpp Source code file to link to your application. =back An application would have the line C<#include "libzpaq.h"> and link to libzpaq.cpp. The API provides two classes, C and C which write or read respectively each of the syntactic elements of a ZPAQ stream. The two functions C and C provide simple interfaces for the most common uses. In either case, the user must create classes derived from the abstract base classes C and C and define methods C and C which the code will use to read and write bytes. The user must also define a callback error handler. By default, libzpaq(3) uses just-in-time (JIT) acceleration by translating ZPAQL code to x86-32 or x86-64 internally and executing it. This feature can be disabled by compiling with -DNOJIT. If enabled, it requires an x86 processor capable of executing SSE2 instructions. SSE2 is supported by most Intel processors since 2001 and AMD since 2003. Run time checks (assertions) can be enabled with -DDEBUG for debugging purposes. All of the API code is contained in the namespace C. =head2 Callback Functions The following three functions must be defined by the user. =over =item C This function must be defined by the user to handle errors from libzpaq. The library will call the function with an English language message passed to C. Errors may result from bad input during decompression, out of memory, or illegal arguments or calling sequences to libzpaq functions. Errors should be considered unrecoverable. =item C The user must create a class derived from Reader with an implementation for C that reads one byte of input and returns its value in the range 0...255, or returns EOF (-1) at end of input. Objects of the derived type would then be passed to functions that require a C. =item C The user must create a class derived from Writer with an implemenation of C which is expected to take a byte value C in the range 0...255 and write it to output. Objects of the derived type would then be passed to functions that require a C. =back The following two functions are optional. Defining them can improve performance slightly. =over =item C If defined, this function should input up to C bytes into the array C and return the number actually read, in the range 0..n. A return value of 0 indicates end of input. If C is not defined, then the default implementation will call C n times. =item C If defined, this function should output the elements C through C in order. If not defined, then the default implementation will call C n times. =back =head2 Simple Compression In the remainder of this document, all classes and functions are assumed to be in namespace C. =over =item C C compresses from C to C until C returns EOF. It writes a single segment in a single block with empty filename, comment, and checksum fields. C must be 1, 2, or 3, to select models I, I, or I respectively. Higher modes compress smaller but take longer to compress and subsequently decompress. =item C C decompresses any valid ZPAQ stream from C to C until C returns EOF. Any non-ZPAQ data in the input is ignored. Any ZPAQ blocks following non-ZPAQ must be preceded by a marker tag to be recognized. Each block is decoded according to the instructions in the block header. The contents of the filename, comment, and checksum fields are ignored. Data with bad checksums will be decoded anyway. If there is more than one segment, then all of the output data will be concatenated. =back =head2 class SHA1 The SHA1 class is used to compute SHA-1 checksums for compression and verify them for decompression. It is believed to be computationally infeasible to find two different strings with the same hash value. Its member functions are as follows: =over =item C The constructor creates a new SHA1 object representing the hash of an empty string. =item C Appends one byte c (0...255) to the string whose hash is represented. =item C Returns the length (so far) of the string whose hash is represented. The largest possible value returned is 2^61 - 1 = 2305843009213693951.0, but values larger than 2^53 = 9007199254740992.0 will not be exact on systems using IEEE 64 bit floating point representation of type C. The initial value is 0.0. =item C Returns the length (so far) as a 64 bit unsigned integer. =item C Computes the 20 byte SHA-1 hash and resets the string back to a size of 0.0. The returned pointer points to an array inside the SHA1 object whose contents remain unchanged until the next call to C. =back =head2 class Compressor The C class has member functions to write each of the syntactic elements of a ZPAQ stream and to specify their values. It will compress using either built-in or user supplied models. =over =item C The constructor creates a Compression object. No input source, output destination, or compression model is specified. =item C Specifies a destination for output. Must be specified before calling any function that writes data. =item C Writes a 13 byte marker tag which can be used to identify the start of a block following non-ZPAQ data. =item C Writes a block header and specifies a compression model. If linked with F, then C must be 1, 2, or 3 to specify I, I, or I respectively. Higher numbers compress smaller but more slowly. These models are compatible with both the ZPAQ level 1 and 2 standards. =item C Writes a block header and specifies the HCOMP portion of the compression model. The first two bytes of the string should encode the length of the rest of the string as a 16 bit unsigned number with the least significant bit first. The meaning of the rest of the string is defined in the ZPAQ level 2 standard. If the number of components (C) is 0, then the block is saved in ZPAQ level 2 format, which cannot be read by older ZPAQ level 1 decoders. Otherwise the block is saved in ZPAQ level 1 format, which is compatible with all decoders. =item C Writes a segment header. C and C are NUL terminated strings. If specified, then their values are stored. Normally, C would be a file name when compressing to an archive or omitted otherwise. If a file is split among segments, then by convention only the first segment is named. C is normally the uncompressed size as a decimal number which is displayed when listing the contents of an archive. Omitting it does not affect decompression. =item C Specifies the optional PCOMP string used for post-processing. It must be called from within the first segment of each block prior to compressing any data, but not from within any other segment. If C is 0 or no argument is passed, then the decompresser will not post-process the data. The effect is to compress a 0 byte to indicate to the decompresser that no PCOMP string is present. If C is not 0, then I bytes of the string I are passed. If I is 0 or omitted, then the first two bytes must encode the length of the rest of the string as a 16 bit unsigned number with the least significant byte first. The format of the remainder of the string is described in the ZPAQ level 2 standard. The effect is to compress a 1 byte to indicate the presence of PCOMP, followed by the two length bytes and the string as passed. For example, either C or C would compress the 5 bytes 1, 2, 0, 5, 8. The user is responsible for pre-processing the input prior to compression so that PCOMP restores the original data. =item C Specifies the input source for compression. It must be set prior to the first call to C. =item C Compress n bytes of data, or until EOF is input, whichever comes first. If n < 0 or omitted, then compress until EOF. Returns true if there is more input available, or false if EOF was read. =item C Stop compressing and write the end of a segment. If C is specified, it should be a 20 byte string as returned by C on the input data for this segment I pre-processing. =item C Finish writing the current block. =back In order to create a valid ZPAQ stream, the components must be written in the following order: for each block do { if any non-ZPAQ data then { write non-ZPAQ data writeTag() } startBlock() for each segment do { startSegment() if first segment in block then { postProcess() } while (compress(n)) ; endSegment() } endBlock() } =head2 class Decompresser The class Decompresser has member functions to read each of the syntactic elements of a ZPAQ stream. =over =item C The constructor creates a Decompresser object. No input source or output destination is specified. =item C Specifies where the ZPAQ stream will be read from. Must be called before any function that reads the stream. =item C Scan the input to find the start of the next block. If a block does not start immediately, then the block must be preceded by a marker tag (written with C) or it will not be found. If C is not 0, then write the approximate memory requirement (in bytes) to decompress to C<*memptr>). The memory will be allocated by the first call to C. It returns true if a block is found, or false if it reads to EOF without finding a block. =item C Write the HCOMP string of the current block to C. It will be in a format suitable for passing to C. The first 2 bytes will encode the length of the rest of the string as a 16 bit unsigned integer with the least significant byte first. The format of the remainder of the string is described in the ZPAQ level 1 specification. =item C Find the start of the next segment. If another segment is found within the current block then return true. If the end of the block is found first, then return false. If a segment is found, the filename field is not empty, and C is not 0, then write the filename (without a terminating NUL byte) to C. =item C Read or skip past the comment field following the filename field in the segment header. If C is not 0 and the comment field is not empty, then write the comment (without a terminating NUL byte) to C. =item C Specify the destination for decompression. It must be set before any data can be decompressed. =item C Specify the address of a SHA1 object for computing the checksum of the decompressed data (after post-processing). As each byte C is output, it is also passed to Cput(c)>. In order to compute the correct checksum, the SHA1 object should be in its initial state, either newly created, or by calling C, before the first call to C. When the end of the segment is reached, the value returned by Cresult()> should match the stored checksum, if any. =item C Decode n bytes or until the end of segment, whichever comes first. Return false if the end of segment is reached first. If n < 0 or not specified, then decompress to the end of segment and return false. C is the number of bytes prior to post-processing. If the data is post-processed, then the size of the output may be different. =item C Write the PCOMP string, if any, for the current block to C. If there is no PCOMP string (no post-processor) then return false. Otherwise write the string to C in a format suitable for passing to C and return true. If written, then the first 2 bytes will encode the length of the rest of the string as a 16 bit unsigned integer with the least significant bit first. The format of the rest of the string is descibed in the ZPAQ level 1 standard. C is only valid after the first call to C in the current block. To read the PCOMP string without decompressing any data, then call C first. It is not necessary to call C in this case. =item C Skip any compressed data in the current segment that has not yet been decompressed and advance to the end of the segment. Then if C is not 0 then write into the 21 byte array that it points to. If a checksum is present, then write a 1 into C and write the stored checksum in C. Otherwise write a 0 in C. Note that it is not permitted to call decompress() if any compressed data has been skipped in any earlier segments in the same block. =back A valid sequence of calls is as follows: while (findBlock()) { while (findFilename()) { readComment(); if first segment in block then { (optional) decompress(0) pcomp() } while (decompress(n)) ; (optional) readSegmentEnd(); } } =head1 EXAMPLES The following program F lists the contents of a ZPAQ archive read from standard input. #include #include #include "libzpaq.h" // Implement Reader and Writer interfaces for file I/O class File: public libzpaq::Reader, public libzpaq::Writer { FILE* f; public: File(FILE* f_): f(f_) {} int get() {return getc(f);} void put(int c) {putc(c, f);} int read(char* buf, int n) {return fread(buf, 1, n, f);} void write(const char* buf, int n) {fwrite(buf, 1, n, f);} }; // Implement error handler namespace libzpaq { void error(const char* msg) { fprintf(stderr, "Error: %s\n", msg); exit(1); } } // List the contents of an archive. For each block, show // the memory required to decompress. For each segment, // show the filename and comment. void list(FILE* input, FILE* output) { libzpaq::Decompresser d; File in(input), out(output); double memory; d.setInput(&in); for (int block=1; d.findBlock(&memory); ++block) { printf("Block %d needs %1.0f MB\n", block, memory/1e6); while (d.findFilename(&out)) { // print filename printf("\t"); d.readComment(&out); // print comment printf("\n"); d.readSegmentEnd(); // skip compressed data } } } int main() { list(stdin, stdout); return 0; } The program could be compiled as follows: g++ listzpaq.cpp libzpaq.cpp The following code compresses a list of files into one block written to stdout. Each file is compressed to a separate segment. For each segment, the filename, comment, and SHA-1 checksum are stored. The comment, as conventional, is the file size as a decimal string. // Compress one file to one segment void compress_file(libzpaq::Compressor& c, const char* filename, bool first_segment) { // Open input file FILE* f; f=fopen(filename, "rb"); if (!f) return; // Compute SHA-1 checksum and file size libzpaq::SHA1 sha1; int ch; while ((ch=getc(f))!=EOF) sha1.put(ch); // Write file size as a comment. // The size can have at most 19 digits. char comment[20]; sprintf(comment, "%1.0f", sha1.size()); // Compress segment rewind(f); File in(f); c.startSegment(filename, comment); if (first_segment) c.postProcess(); c.setInput(&in); c.compress(); c.endSegment(sha1.result()); // Close input file fclose(f); } // Compress a list of argc files in argv[0...argc-1] into one // ZPAQ block to stdout at level 2. void compress_list(int argc, char** argv) { libzpaq::Compressor c; File out(stdout); c.setOutput(&out); c.startBlock(2); for (int i=0; i and C can be passed an argument n to display progress every n bytes, for example: for (int i=1; d.decompress(1000000); ++i) fprintf(stderr, "Decompressed %d MB\n", i); To compress or decompress to and from objects in memory, derive appropriate classes from C and C. For example, it is possible to compress or decompress to a C using the following class. struct String: public libzpaq::Writer { std::string s; void put(int c) {s+=char(c);} }; This class is also useful for reading the filename and comment fields during decompression as follows: String filename, comment; while (d.findFilename(&filename)) { d.readComment(&comment); // ... =head1 AVAILABILITY I, I, and the ZPAQ level 1 and 2 specifications are available from L. =head1 SEE ALSO C C =cut