dirfile-encoding (5) - Linux Man Pages

dirfile-encoding: dirfile database encoding schemes

NAME

dirfile-encoding --- dirfile database encoding schemes

DESCRIPTION

The Dirfile Standards indicate that RAW fields defined in the database are accompanied by binary files containing the field data in the specified simple data type. In certain situations, it may be advantageous to convert the binary files in the database into a more convenient form. This is accomplished by encoding the binary file into the alternate form. A common use-case for encoding a binary file is to compress it to save disk space. Only data is modified by an encoding scheme. Database metadata is unaffected.

Support for encoding schemes is optional. An implementation need not support any particular encoding scheme, or may only support certain operations with it, but should expect to encounter unknown encoding schemes and fail gracefully in such situations.

Additionally, how a particular encoding is implemented is not specified by the Dirfile Standards, but, for purposes of interoperability, all dirfile implementations are encouraged to support the encoding implementation used by the GetData dirfile reference implementation, elaborated below.

An encoding scheme is local to the particular format file fragment in which it is indicated. This allows a single dirfile to have binary files which are stored using multiple encodings, by having them defined in multiple fragments.

The rest of this manual page discusses specifics of the encoding framework implemented in the GetData library, and does not constitute part of the Dirfile Standards.

THE GETDATA ENCODING FRAMEWORK

The GetData library provides an encoding framework which abstracts binary file I/O, allowing for generic support for a wide variety of encoding schemes. Functions which may make use of the encoding framework are:

dirfile_add(3),~dirfile_add_raw(3),~dirfile_add_spec(3), dirfile_alter_encoding(3),~dirfile_alter_endianness(3), dirfile_alter_frameoffset(3),~dirfile_alter_entry(3), dirfile_alter_raw(3),~dirfile_alter_spec(3),~dirfile_move(3), dirfile_rename(3),~getdata(3),~get_nframes(3), and putdata(3). Most of the encodings supported by GetData are implemented through external libraries which handle the actual file I/O and data translation. All such libraries are optional; a build of the library which omits an external library will lack support for the associated encoding scheme. In this case, GetData will still properly identify the encoding scheme, but attempts to use GetData for file I/O via the encoding will fail with the GD_E_UNSUPPORTED error code.

GetData discovers the encoding scheme of a particular RAW field by noting the filename extension of files associated with the field. Binary files which form an unencoded dirfile have no file extension. The file extension used by the other encodings are noted below. Encoding discovery proceeds by searching for files with the known list of file extensions (in an unspecified order) and stopping when the first successful match is made. Because of this, when the a field has multiple data files with different, supported file extensions which could legitimately be associated with it, the encoding scheme discovered by GetData is not well defined.

In addition to raw (unencoded) data, GetData supports five other encoding schemes: text encoding, bzip2 encoding, gzip encoding, lzma encoding, and slim encoding, all discussed below.

Text Encoding

The Text Encoding is unique among GetData encoding schemes in that it requires no external library. As a result, all builds of the library contain full support for this encoding. It is meant to serve as a reference encoding and example of the encoding framework for work on other encoding schemes.

The Text Encoding replaces the binary data files with 7-bit ASCII files containing a decimal text encoding of the data, one sample per line. All operations are supported by the Text Encoding. The file extension of the Text Encoding is .txt.

BZip2 Encoding

The BZip2 Encoding compresses raw binary files using the Burrows-Wheeler block sorting text compression algorithm and Huffman coding, as implemented in the bzip2 format. GetData's BZip2 Encoding scheme is implemented through the the bzip2 compression library written by Julian Seward. GetData's BZip2 Encoding framework currently lacks write capabilities; as a result the BZip2 Encoding does not support functions which modify binary data.

GetData caches an uncompressed megabyte of data at a time to speed access times. A call to get_nframes(3) requires decompression of the entire binary file to determine its uncompressed size, and may take some time to complete. The file extension of the BZip2 Encoding is .bz2.

GZip Encoding

The GZip Encoding compresses raw binary files using Lempel-Ziv coding (LZ77) as implemented in the gzip format. GetData's GZip Encoding scheme is implemented through the the zlib compression library written by Jean-loup Gailly and Mark Adler. GetData's GZip Encoding framework currently lacks write capabilities; as a result the GZip Encoding does not support functions which modify binary data.

To speed the operation of get_nframes(3), the GZip Encoding takes the uncompressed size of the file the gzip footer, which contains the file's uncompressed size in bytes, modulo 2^32. As a result, using a field with an (uncompressed) binary file size larger than 4~GiB as the reference field will result in the wrong number of frames being reported. The file extension of the GZip Encoding is .gz.

LZMA Encoding

The LZMA Encoding compresses raw binary files using the Lempel-Ziv Markov Chain Algorithm (LZMA) as implemented in the xz container format. GetData's LZMA Encoding scheme is implemented through the lzma library, part of the XZ Utils suite written by Lasse Collin, Ville Koskinen, and Igor Pavlov. GetData's LZMA Encoding framework currently lacks write capabilities; as a result the LZMA Encoding does not support functions which modify binary data.

As with the BZip2 Encoding, GetData caches an uncompressed megabyte of data at a time to speed access times. A call to get_nframes(3) requires decompression of the entire binary file to determine its uncompressed size, and may take some time to complete. The file extension of the LZMA Encoding is .xz, or .lzma.

Slim Encoding

The Slim Encoding compresses raw binary files using the slimlib compression library written by Joseph Fowler. The slimlib library was developed at Princeton University to compress dirfile-like data. GetData's Slim Encoding framework currently lacks write capabilities; as a result, the Slim Encoding does not support function which modify binary files. The file extension of the Slim Encoding is .slm.

AUTHOR

This manual page was by D. V. Wiebe <dvw [at] ketiltrout.net>.

SEE ALSO

dirfile(5), dirfile-format(5), bzip2(1), gzip(1), zlib(3).