PZIP 3 "1998-08-11"
NAME \" @(#)pzip.3 (gsf@research.att.com) 1998-08-11
pzip - fixed record partition compress/decompress library
SYNOPSIS
.fp 5 CW .. .}S 5 1 "\\$1" "\\$2" "\\$3" "\\$4" "\\$5" "\\$6" .. .}S 1 5 "\\$1" "\\$2" "\\$3" "\\$4" "\\$5" "\\$6" .. .}S 5 2 "\\$1" "\\$2" "\\$3" "\\$4" "\\$5" "\\$6" .. .}S 2 5 "\\$1" "\\$2" "\\$3" "\\$4" "\\$5" "\\$6" ..

0

..

.. .fl

.. .fl

"\\$1"
..
#include <pzip.h>
.Ss "DATA TYPES"
Pz_t;
Pzpart_t;
Pzdisc_t;

int (*Pzerror_f)(Pz_t*, Pzdisc_t*, int, const char*, ...);
int (*Pzcheck_f)(Pz_t*, unsigned char*, Pzdisc_t*);
.Ss "CONSTANTS"
PZ_VERSION

PZ_WINDOW

PZ_MAGIC_1
PZ_MAGIC_2

PZ_TEST
.Ss "FLAGS"
PZ_READ
PZ_WRITE
PZ_FORCE
PZ_STAT
PZ_STREAM
PZ_CRC
PZ_DUMP
PZ_VERBOSE
PZ_TEST_1
PZ_TEST_2
PZ_TEST_3
PZ_TEST_4

SFPZ_VERIFY
SFPZ_CRC
.Ss "OPENING/CLOSING STREAMS"
Pz_t* pzopen(Pzdisc_t* disc, const char* path, int flags);
int pzclose(Pz_t* pz);
.Ss "INPUT/OUTPUT OPERATIONS"
int pzdeflate(Pz_t* pz, Sfio_t* out);
int pzinflate(Pz_t* pz, Sfio_t* out);

int pzgethdr(Pz_t* pz);
int pzputhdr(Pz_t* pz, Sfio_t* out);

.Ss "STREAM INFORMATION"
Pzpart_t* pzgetpart(Pz_t* pz, const char* name);
Pzpart_t* pzsetpart(Pz_t* pz, Pzpart_t* part);

void pzdump(Pz_t* pz, Sfio_t* out);
.Ss "SFIO DECOMPRESS DISCIPLINE"
int sfdcpzip(Sfio_t* stream, const char* options, int flags);
DESCRIPTION

pzip is a library of functions that support compression/decompression of data files of fixed length rows (records) and columns (fields). Each pzip stream represents a file of compressed or decompressed data that may be decompressed or compressed to an sfio (3) output file stream.

pzip format performs better than gzip in space/time on data that has many (> 50%) columns that change at a low rate (columns with a low rate of change are low frequency; columns with a high rate of change are high frequency.)

The pzip compress format is itself gzipped using the sfdcgzip () sfio discipline. Decompressed data is reorganized according to the user-specified partition-file (see .L Pzdisc_t.partition below) before being passed to gzip . Low frequency columns are run-length encoded and high frequency column groups are transposed to column-major order. The gzip tables are flushed, via the sfsync () discipline function, between each column partition group. This has a positive space/time effect on the gzip string match and huffman tables.

pzip format files self-idenfify by encoding partition information in a header.

.Ss "DATA TYPES"

.Ss " Pz_t" This is the stream handle returned by pzopen . None of the fields should be changed by the application. The fields are: .Tp .L "const char* id" The library identification string used by the libast errorf function to identify the source of error and warning messages. .Tp .L "Pzdisc_t* disc" A pointer to the user discipline structure from pzopen . .Tp .L "unsigned int flags" The inclusive-or of the flags described below. .Tp .L "int major" The pzip file format major number. For .L PZ_READ it is the the value when the file was compressed, otherwise it is the major number of the current implementation. In general the file header and implementation major numbers must match. .Tp .L "int minor" The pzip file format minor number. For .L PZ_READ it is the the value when the file was compressed, otherwise it is the minor number of the current implementation. In general, all implementations and data files with the same file header major number are compatible. The minor allows the implementation to make runtime adjustments. .Tp .L "size_t win" The high frequency window size. This is the adjusted value that is divisible by both the row size and number of high frequency columns. .Tp .L "const char* path" The input file pathname. .Tp .L "Sfio_t* io" The input file sfio stream. .Tp .L "Vmalloc_t* vm" The vmalloc (3) region handle for the entires stream. All allocations associated with the stream, except for sfio intiated allocations, are done in this region. The user may also allocate and free individual chunks of memory from this region, but must not call vmclear (). The region is freed by pzclose (). .Tp .L "size_t npart" The number of partitions. .Tp .L "Pzpart_t* part" The current active partition. .Tp .L "Pzpart_t* parts" A table of all partitions. .Tp .L "unsigned char* buf" The high frequency window buffer with .L win elements. .Tp .L "unsigned char* wrk" The PZ_WRITE high frequency window buffer with .L win elements. .Tp .L "unsigned char* pat" The low frequency pattern buffer with .L row elements.

.Ss " Pzpart_t"

.L Pzpart_t defines one partition. The fields are: .Tp .L "char* name" The partition name. .Tp .L "int index" The partition index. May be used to access a partition given its index: .L "Pz_t.parts[Pzpart_t.index-1]." .Tp .L "size_t row" The partition fixed row size. .Tp .L "size_t col" The number of rows that can fit into the high frequency window column buffer. .Tp .L "size_t* map" An array with .L nmap elements that lists the high frequency column indexes in order. .Tp .L "size_t* grp" An array with .L ngrp elements that lists the sizes of each high frequency column partition group in the same order as .LR map . .Tp .L "size_t nmap" The number of elements in .LR map . .Tp .L "size_t ngrp" The number of elements in .LR grp . .Tp .L "unsigned char* low" An array with .L row elements. .LI low[ i ] is .L 1 if column i is low frequency, otherwise it is .L 0 (and column i is high frequency.) .Tp .L "int* value" If there are no fixed-value columns the .L value is .LR NULL . Otherwise it is an array with .L row elements. .LI value[ i ] is non-negative if column i has a fixed value (and .LI value[ i ] is the fixed column value). .Tp .L "size_t* fix" An array with .L nfix elements that lists fixed value columns. .Tp .L "size_t nfix" The number of elements in .LR fix .

.Ss " Pzdisc_t"

.L Pzdisc_t defines a stream discipline structure to the pzopen () function. The discipline fields are: .Tp .L "unsigned long version" Must be initailized to .LR PZ_VERSION . .Tp .L "const char* comment" An optional string that is placed in the pzip output file header; this string can be viewed by the pzip (1) command. Ignored for .L PZ_READ and .L PZ_STAT streams. .Tp .L "const char* options" An optional string of run-time options of the form name = value . Currently only fixed value columns may be specified. The syntax is begin [ -end ]= "'value'" where begin is the beginning column offset (starting at 0), end is the ending column offset for an inclusive range, and value is the fixed column value. Decompress time is improved when high frequency columns are given fixed values. .Tp .L "const char* partition" The name of the partition-file that contains a sequence of lines that specifie the data row size and the high frequency column partition groups. This entry must be specified for .L PZ_WRITE and is ignored for .L PZ_READ streams. Comments start with # and continue to the end of the line. The first non-comment line specifies the row size. The remaining lines operate on column offset ranges of the form: begin [ -end ] where begin is the beginning column offset (starting at 0), end is the ending column offset for an inclusive range. The operations are:

.Tp range " [ ... ]" places all columns in the specified range list in the same high frequency partition group. Each high frequency partition group is processed as a separate block by gzip . .Tp "range='value'" specifies that each column in range has the fixed character value value . C-style character escapes are valid for value .

.Tp .L "const char* lib" The library name used by pathfind (3) to locate partition files. The default is .LR '"pzip"' , and the default partition file suffix is .LR .prt . .Tp .L "size_t window" Low frequency columns are processed one row at a time; high frequency columns are processed across many rows at a time. The space/time tradeoff is controlled by the number of high frequency columns that can be processed in one step. .L window sets this limit. The high frequency columns are transposed from row-major order to column-major order, which may bring on inefficient paging behavior on some systems when the window size is too large. The default of 4M (4194304) provides reasonable behavior across most paging implementations. Note that compression requires two .L window buffers whereas decompression requires one. .L window is shortened to be divisible by both the row size and the number of high frequency columns. .Tp .L "int (*errorf)(Pz_t* pz, Pzdisc_t* disc, int lev, const char*, fmt ...)" An optional function that is called to emit error and warning messages. It is most often set to the libast errorf (): .EX disc.errorf = (Pzerror_f)errorf; .Tp .L "int (*eventf)(Pz_t* pz, int event, void* value, Pzdisc_t* disc)" An optional function that is called when events occur during stream processing. .L event is set to the event and .L value is an event specific value. The events are:

.Tp .L PZ_OPEN Called just before pzopen () returns successfully. .L value is .LR 0 . A .L -1 return value causes pzopen () to fail. .Tp .L PZ_CLOSE Called before pzclose () releases any resources. .L value is .LR 0 . The return value is used as the pzclose () return value. .Tp .L PZ_CHECK Called as each row in a .L PZ_WRITE stream is processed. .L value is a pointer to the row data before compression, and .L eventf may modify the contents up to the row size. The return value determines the disposition of the row: .L -1 terminates all processing; .L 0 ignores the row; otherwise the row is processed as usual.

.Ss "CONSTANTS"

.Tp .L PZ_VERSION This is a macro value of type .L "long int" that defines the current version number of the pzip library interface. The form is a six digit date YYYYMMDD. .Tp .L PZ_WINDOW The default window size (4Mb). .Tp .L PZ_MAGIC_1 The first character of the two character pzip header magic number. .Tp .L PZ_MAGIC_2 The second character of the two character pzip header magic number.

.Ss "BIT FLAGS"

A number of bit flags control stream operations. They are set by the .L flags argument to pzopen (). The flags are: .Tp .L PZ_READ The input file is opened for decompression to the output file. .Tp .L PZ_WRITE The input file is opened for compression to the output file. The input file is opened for decompression. .Tp .L PZ_FORCE For .LR PZ_READ , if the input file is not in pzip format, then pzinflate () operates in transparent mode. If the input file is in gzip format then gzip inflate is applied. Otherwise .L PZ_READ input files must be in pzip format. .Tp .L PZ_STAT The input file must be in pzip format; the handle may be used to retrieve header information, but pzinflate () is disabled. .Tp .L PZ_STREAM The .L path argument to pzopen () is treated as an sfio .L SF_READ stream. This is a hack used by sfdcpzip (). .Tp .L PZ_CRC Enables decompress crc checking. crc checking is a perfomance wart in the otherwise respectable libz (3) gzip library implementation. Decompression crc checking can increase decompression user time by as much as a factor of 2. pzip uses a version of libz that disables decompression crc checking and replaces it with a few sanity checks. The pzip format also has its own checks. .Tp .L PZ_DUMP Calls pzdump () just before a successful return from pzopen (). .Tp .L PZ_VERBOSE Enables a verbose trace of internal actions. .Tp .L PZ_TEST_1 Enables the implementation defined test #1. .LR PZ_TEST_2 , .LR PZ_TEST_3 , and .L PZ_TEST_4 also provided.

.Ss "OPENING/CLOSING STREAMS"

Pz_t* pzopen(Pzdisc_t* disc, const char* path, int flags);

This function opens a stream on .LR file . It returns a new stream handle on success and .L NULL on error. .L disc and .L flags are described above. If .L flags contains .L PZ_READ then pzinflate () may be called to decompress .LR file , otherwise if .L flags contains .L PZ_WRITE pzdeflate () may be called to compress .LR file .

int pzclose(Pz_t* pz);
This functions close the stream handle .L pz returned by a previous call to pzopen (). It returns .L 0 on success and .L -1 on error. All resources allocated on behalf of the stream are released.

.Ss "INPUT/OUTPUT OPERATIONS"

int pzdeflate(Pz_t* pz, Sfio_t* out);
This function compresses the entire .L PZ_WRITE pzip stream .L pz to the output sfio stream .LR out . It returns .L 0 on success and .L -1 on error.

int pzinflate(Pz_t* pz, Sfio_t* out);
This function decompresses the entire .L PZ_READ pzip stream .L pz to the output sfio stream .LR out . It returns .L 0 on success and .L -1 on error.

int pzgethdr(Pz_t* pz);
This function reads the header from the .L PZ_READ pzip stream .L pz and fills in the appropriate fields in .LR pz . It returns .L 0 on success and .L -1 on error.

int pzputhdr(Pz_t* pz, Sfio_t* out);
This function writes the header from the .L PZ_WRITE pzip stream .L pz to the sfio output stream .LR out . It returns .L 0 on success and .L -1 on error.

.Ss "STREAM INFORMATION"

Pzpart_t* pzgetpart(Pz_t* pz, const char* name);
This function returns a partition given its name. 0 is returned if the partition is not found. The default partition name is the empty string ("").

Pzpart_t* pzsetpart(Pz_t* pz, Pzpart_t* part);
This function sets the current active partition to part. The previous active partition is returned. The current active partition is initialized to the default parition ("").

void pzdump(Pz_t* pz, Sfio_t* out);
This function writes the header header information from the pzip stream .L pz to the sfio output stream .L out in partition file format. A file containing this information is suitable for the .L Pzdisc_t.partition field. It returns .L 0 on success and .L -1 on error.

.Ss "SFIO DECOMPRESS DISCIPLINE"

int sfdcpzip(Sfio_t* sp, const char* options, int flags);
This function pushes a pzip decompress sfio discipline on the sfio stream .LR sp by calling pzopen () with .L PZ_READ|PZ_STREAM on .L sp and setting .L Pzdisc_t.options to .LR options . Because of the extra information involved, pzip compression is not supported by the discipline. This is better handled by the pzip (1) command interface. This means that .L sp must be an .L SF_READ sfio stream. sfdcpzip () stacks the gzip decompress discipline sfdcgzip () on .L sp if necessary. 0 is returned on success.
AUTHOR
Glenn Fowler, gsf@research.att.com.