Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

An Algorithm for Online Data Compression


October 1998/An Algorithm for Online Data Compression

An Algorithm for Online Data Compression

Sergey Ignatchenko

Compressing data on the fly involves different tradeoffs than when you have all the time and space in the world.


Lossless data compression algorithms have a wide range of applications. The most common, file compression, is nearly unrestricted in terms of memory usage and operation time. The most essential algorithm metric for file compression is the compression ratio. But in the case of online data compression, the situation is quite different. For the purpose of this article, I define online data compression as the compression of stream-based material where the compressor/decompressor pair must be able to process data at a rate equivalent to the input rate. Moreover, in some cases there must be a constant upper bound on the delay between compressor input and decompressor output.

Online data compression imposes a number of restrictions on compression algorithms. First, these compression algorithms should meet strict requirements, for low memory usage and high processing speed. Second, data flow in interactive systems usually consists of small data packets (several hundred bytes), and each of these data packets should be delivered as quickly as possible. For the compression algorithm described in this article, this second requirement means that the packets must be self-contained. That is, if packet sequence {A, B} is compressed into packet sequence {A', B'}, then the information contained in packet A' should be sufficient to decompress packet A. Thus, at the end of compression of packet A, all information needed for decompression of packet A must be placed in packet A'. This procedure is usually referred to as flush. Frequent flushes reduce the compression ratio for all compression algorithms, but some algorithms are more vulnerable than others to frequent flushes. For online compression, flush frequency depends on the size of the data packet.

In this article I describe the stream-based algorithm LZH-Light. LZH-Light is similar to the well-known Deflate [1], but has significant advantages over Deflate in terms of online data compression. I will outline those advantages shortly, then explain how the LZH-Light algorithm works.

Why LZH-Light?

At present, a large number of different lossless compression algorithms exist. Deflate is one of the most popular algorithms, and zlib [2] (http://www.cdrom.com/pub/infozip/zlib/) is one of its most popular implementations. The Deflate algorithm is some combination of the LZ77 algorithm and Huffman-like encoding, and it is very popular for offline compression. However, its usefulness for online compression is limited by the following drawbacks. (The figures given in parentheses are for zlib.)

  • Deflate suffers a considerable degradation of compression ratio (up to 30-40 percent) and compression speed (down to 50 percent) for small packets.
  • Deflate requires a relatively large amount of memory (260 Kb for the compressor and 43 Kb for the decompressor) by default. Although memory usage can be reduced to 10 Kb for the compressor and 13 Kb for the decompressor, the compression ratio in this case is decreased by as much as 55 percent.
  • Deflate's compression speed is relatively low by default (150 Kb/sec on the Pentium-100 processor). Although compression speed can be increased to 900 Kb/sec, the compression ratio in this case is reduced by as much as 25 percent.

The LZH-Light algorithm was developed to overcome problems concerning online data compression. Similar to the Deflate algorithm, it is a combination of the LZ77 algorithm and Huffman-like encoding. The advantages of LZH-Light are:

  • LZH-Light's compression ratio depends only slightly upon flush frequency.
  • LZH-Light uses a very small, fixed amount of memory (16 Kb for the compressor and 8 Kb for the decompressor by default; minimum values are 3.7 Kb and 2.3 Kb, respectively).
  • LZH-Light has been optimized to achieve maximum speed (where the compression speed exceeds 1 Mb/sec on the Pentium-100 processor) with a satisfactory compression ratio.
  • Like Deflate, LZH-Light (both in its data format and in its current implementation) is independent of the CPU type, the operating system, and the character set. Thus, it can be used for data transfer between different platforms.

Although the LZH-Light algorithm yields lower offline compression ratios than the Deflate algorithm, LZH-Light can be more suitable for online compression.

LZH-Light Description

My description of the LZH-Light technique consists of two parts: data formats, and algorithms that support the data formats. This description is not complete; it presents basic principles, not a detailed specification. The full source code for the current LZH-Light implementation is available on the CUJ ftp site (see page 3 for downloading instructions), which can be used as reference material.

Data Formats

The LZH-Light algorithm compresses data in two stages: the LZ77 stage and the Huffman-like stage. At the first (LZ77) stage, raw source data, interpreted as a byte stream, is converted into an intermediate stream of symbols. Each symbol in this stream belongs to the range 0-273. In addition, some symbols may have one or two attached suffixes, each suffix consisting of one or several bits. In the second (Huffman-like) stage, the intermediate stream is converted into a compressed output bit stream. This bit stream is LZH-Light's compression output. The compression/decompression process is depicted in Figure 1.

The LZH-Light algorithm contains a parameter that affects the data format. This parameter is LZBUFSIZE, the size of the sliding window for LZ77's compressor and decompressor. On one hand, the ability to change LZBUFSIZE allows the programmer to control memory usage. On the other hand, the existence of such parameter means that LZH-Light is actually not a single algorithm, but instead is a family of algorithms with incompatible data formats.

The Intermediate Data Format

The intermediate stream of symbols produced by LZ77's compression (S in Figure 1) will be reproduced on the output side (S' in Figure 1) of the Huffman-like decompressor, and fed as input into the LZ77 decompressor. This intermediate stream of symbols consists of simple instructions for the LZ77 decompressor.

  • A ch symbol within the range 0-255 means that the symbol ch will be placed in the decompressor's output byte stream.
  • A ch symbol within the range 256-271 means that a certain string, previously placed in the output stream, will again be written to the output stream. This ch symbol has two attached bit suffixes. The first bit suffix (in conjunction with the ch code) represents the string length. The second bit suffix represents the backward distance from the current position in the output stream to the beginning of this string.

The length of such string may range from 4 to 521, and the backward distance may range from 0 to LZBUFSIZE. (If the backward distance is less than the string's length, then the referenced string overlaps the current position.) Tables 1 and 2 show how the ch code, first bit suffix, and second suffix are combined to represent the length and the backward distance.

  • A ch symbol equal to 272 (HUFFRECALC) is used for synchronizing the compressor's and the decompressor's coding tables (more on this later). This symbol also has a bit suffix.
  • A ch symbol equal to 273 (END_BLOCK) indicates the end of the current compressed data block.

Thus, the LZ77 compressor's task is to build an intermediate stream S, such that the LZ77 decompressor, which applies the previously described instructions to S' (identical to S), will output a raw byte stream identical to the input raw byte stream of the compressor.

The LZ77 compressor tries to encode (using symbols 256-271) every substring from the input raw stream that occurred earlier. This substring must have a length of at least four bytes, and its previous occurrence must be not earlier than LZBUFSIZE symbols from the current position.

Huffman-Like Encoding Format

The second stage of LZH-Light compression is similar to Huffman coding. At every given point during compression, a Huffman-like coding table exists. This table defines how many and which bits are used for encoding each symbol of the intermediate stream. (Recall that these symbols of may range from 0 to 273.) If a symbol of the intermediate stream has bit suffixes, these bit suffixes do not undergo Huffman-like encoding. They are placed into the output stream following the bits representing the symbol. The decompressor maintains its own copy of this table. At intervals, the compressor updates the table and places information about these updates in its output stream. The decompressor responds to this information, so that the decompressor's table is maintained identical to the compressor's table at every corresponding point in the decompression process.

The Huffman-like encoding table used in LZH-Light differs from the classic Huffman coding table. All symbols of the intermediate stream (0-273) are divided into 16 groups. With each group, indicated by group number grp (0-15), is associated a number of bits nGroupBit(grp) ranging from 0 to 8. Each group contains not more than 2nGroupBit(grp) symbols. A symbol ch of group grp is encoded by first placing the four bits of grp into the output stream, followed by the nGroupBit bits of the number representing the symbol's position within its assigned group. For example, if ch is the third symbol in group 1, and nGroupBit for group 1 is 2, the Huffman-like encoder would send the binary sequence 000110 to the output stream. The first four bits (0001) are the group number and the last two bits (10, equivalent to 2 in decimal) represent that ch is the third symbol in its group.

This approach has both advantages and disadvantages compared to classic Huffman coding:

  • LZH-Light coding allows symbol extraction in exactly two reading operations (unlike bit-by-bit in Huffman coding). This improves decompression speed.
  • Worst-case encoding for LZH-Light is 12 bits/symbol, which is much better than the worst-case for classic Huffman coding (theoretically up to 272 bit/symbol; in practice up to 25-35 bit/symbol).
  • Since classic Huffman coding is the best possible coding method of this type, the LZH-Light coding yields a worse compression ratio. However, the difference is small enough (less than 1 percent in the average case), provided symbols are correctly divided into groups.

Huffman-Like Table Synchronization

For correct operation of Huffman-like coding, it is necessary to synchronize the coding tables of both the compressor and the decompressor. The LZH-Light algorithm achieves this as follows.

  • The coding tables for the compressor and the decompressor are initialized to identical values, according to Table 3.
  • Both the compressor and the decompressor maintain a table of statistics, stat. This table holds a running count of every symbol SYM encountered, where SYM ranges from 0 to 273. Table entry stat(SYM) is incremented every time the character equal to SYM appears in the intermediate stream. The statistics tables in the compressor and the decompressor are identical at corresponding points of the compression and decompression process.
  • To change the coding table, the compressor does the following:
    • 1. The compressor places bits corresponding to the HUFFRECALC symbol (272) into the output stream.
      2. The compressor forms a symbol table (symbolTable) containing symbols 0-273, which are sorted by the key {stat(SYM), SYM} in descending order. Using SYM as a part of the key ensures that the key is unique.
      3. The compressor divides the symbol table into 16 groups so that group 0 contains the first 2nBits(0) symbols from symbolTable, group 1 contains next 2nBits(1) symbols, and so on. It is the compressor that actually defines the values nBits(0), nBits(1), ... , taking into account the following restrictions: the number of symbols in each group shall not exceed 256, and each successive group contains at least as many symbols as the previous group.
      4. For every group grp (grp >= 0, grp <= 15) the compressor computes the difference in the number of symbols between this group and the previous group:

      d(grp) = grp ? nBits(grp) - nBits(grp-1) : nBits(grp)
      

      The compressor places the bits corresponding to d(grp) into the output stream in accordance with Table 4.

      5. For every symbol SYM, the compressor performs the operation stat(SYM) /= 2 to reduce the influence of older symbols compared to the newer symbols.

  • When the decompressor encounters the symbol HUFFRECALC in the intermediate stream, it forms its own symbolTable by sorting its own copy of stat(SYM). (Both the compressor and decompressor keep track of the statistics of symbols encountered in the stream.) The decompressor then divides the symbolTable into groups as specified by the bits following the symbol HUFFRECALC. Next, the decompressor performs the operation stat(SYM) /= 2 for each symbol SYM.

The LZH-Light Algorithms

Strictly speaking, a description of the LZH-Light data format provides sufficient information for implementing a compressor and a decompressor that are LZH-Light-compliant. Nevertheless, studying the algorithms used in the current implementation may be helpful. I describe some of these algorithms in this section.

Searching for Duplicate Strings

The LZ77 compression stage relies on detecting duplicate strings in the input stream. The LZ77 compressor uses a hash table to find such strings. This table (hashTable) is constructed with a hash function that operates on substrings of length LZMATCH. Each cell hashTable[hashValue] contains a position index, indicating where a substring of length LZMATCH with a hash value of hashValue was recently encountered.

For each input byte, the compressor calculates the hash value for the string composed of the previous LZMATCH-1 symbols from the input stream plus the input byte. In other words, the hash function operates as a sliding window of length LZMATCH over the input stream. If the hash table contains an entry for this hash value and this entry was encountered no earlier than LZBUFSIZE symbols before, then the compressor calculates the match length between the current string and the previously encountered string (a match length less than LZMATCH can occur as a result of a hash table collision).

Obviously, the data model being used is less accurate than tree-like data models (and less accurate than Deflate's data model), and yields a worse compression ratio. Nevertheless, this effect can be considerably reduced. One of the biggest contributors to LZH-Light's lower accuracy is hash-table collisions. Consider the input stream ABCDEFGH...ABCDEFGH. Suppose that while analyzing the second substring "ABCDEFGH" the compressor was prevented from finding a match due to a hash-table collision. In this case the probability of finding a match for any of the substrings "BCDEFGH", "CDEFGH", or "DEFGH" is still very high. Thus, to increase the compression ratio the compressor tries to match not only in the forward direction — starting with the symbols at the position indicated by the hash table — but also in the reverse direction to match as many symbols as possible preceding this position. (In the current implementation this algorithm is found in the section enclosed in #ifdef LZBACKWARDMATCH ... #endif.)

Another way to increase the compression ratio is by using the so-called lazy match (see [1]). After a match of length N beginning with the symbol ch is found, the compressor repeats the entire search process (including hashing) starting at the next input byte after ch. If the compressor finds a new match with length M that is longer than N, it abandons the match found at ch. It then places ch into the intermediate stream without modification, followed by the code indicating the match of length M. Otherwise, the compressor places the code indicating the match of length N into the intermediate stream.

I obtained interesting results from studying different methods of hash-value calculation. The classical method of hash-value calculation for a given string s of length LEN can be expressed as follows:

unsigned hash = 0;
for( int i=0; i < LEN ; ++i )
    //ROTL( x, shift ) rotates bits
    //of x to shift positions left
    hash = ROTL(hash, SHIFT) ^ s[i];
hash %= TABLE_SIZE;

(SHIFT is chosen more or less arbitrarily, and the best results are obtained when TABLE_SIZE is 9 prime number.)

zlib uses an alternative method:

#define SHIFT \
    ((TABLE_BITS+LEN-1)/LEN)
unsigned hash = 0;
for( int i=0; i < LEN ; ++i )
    hash = (hash << SHIFT) ^ s[i];
//TABLE_SIZE is 1 << TABLE_BITS
hash &= (TABLE_SIZE - 1);

Using the classical method ensures a smaller number of hash-table collisions (and, therefore, a higher compression ratio), but it is much slower than zlib's method. This speed reduction is caused in fact by the % operation. Analysis reveals a third method, which is similar to the classical method but much faster:

unsigned hash = 0;
for( int i=0; i < LEN ; ++i )
    hash = ROTL(hash, SHIFT) ^ s[i];
//TABLE_SIZE is 1 << TABLE_BITS
hash = (hash * A + B) >>
    (sizeof(unsigned) - TABLE_BITS);

Here the last calculation represents a formula for a linear congruential pseudo random-number generator, where A and B are specially selected constants. This method ensures that the number of hash-table collisions is a bit lower than that of the classical method, and is a much faster operation. (This is at least true on x86 processors. On x86 processors the * operation is up to three times faster than the % operation.)

Group Selection for Huffman-Like Encoding

To achieve the highest possible compression ratio using groups of symbols, the sum of stat(SYM) for symbols belonging to group 0 must be equal to the sum of stat(SYM) belonging to group 1, and so on. In practice strict equality is not possible, but using the following algorithm achieves a rather accurate approximation:

    1. Calculate total number of symbols in stat:

    total = 0;
    for(SYM = 0; SYM <= 273; ++SYM)
        total += stat(SYM);
    

    2. Set used = 0; pos = 0;
    3. For each group grp from 0 to 14:

      a. Calculate the desired size of grp:

        desiredGroup =
            (total-totalUsed)/(16-group);
      

      b. For nBits from 0 to 8, calculate the number of symbols that would be contained in group grp:

        nSym[nBits] = 0;
        for(i = 0; i < (1 << nBits); ++i)
          nSym[nBits] += stat(pos+i);
      

      c. Choose the value of nBits such that abs(nSym[nBits] -desiredGroup) is a minimum.
      d. Set the size of grp to that value of nBits.
      e. Update used and pos:

        used += nSym[nBits];
        pos += 1 << nBits;
      

In a practical implementation, the algorithm is more complex due to restrictions imposed on groups (see the section "Huffman-Like Table Synchronization" above) and optimizations.

Frequency of Changes to Coding Table

In accordance with the described data format, the LZH-Light compressor chooses when to change the tables used in Huffman-like encoding. At present this algorithm is one of the least developed algorithms of the current LZH-Light implementation: tables are automatically changed after HUFFRECALCLEN (by default 4096) symbols have passed through the intermediate stream.

Current Implementation of LZH-Light

Although the current implementation of the LZH-Light algorithm is written in C++, it has a C interface, which is defined in lzhl.h (see Listing 1). This allows use of this implementation with both C and C++ programs.

To increase the algorithm's operation speed, all algorithm customizations are performed at compile time with the help of macro definitions (see the _lshl.h file in the source archives). As mentioned previously, the only parameter that affects the LZH-Light data format is LZBUFSIZE. All other parameters affect only compressor operation and are used for achieving appropriate balance of compression speed, compression ratio, and amount of memory used. Since a large number of LZH-Light algorithm configurations can be obtained using macro definitions from _lzhl.h, comprehensive testing and benchmarking were performed only for several recommended configurations (see Table 5).

The program in Listing 2 can be used for testing various configurations of the LZH-Light algorithm. This program provides sample compression/decompression of a single file. It is not optimal; its only purpose is to verify proper operation for the chosen configuration.

Efficiency and Performance

The measurements of algorithm efficiency for LZH-Light and the zlib implementation of Deflate were carried out under the following conditions:

  • data: I used the so-called Calgary Corpus, which is a standard set of files typically used in compression tests. I concatenated these files into one 3,175 Kb file.
  • computer: Pentium 100 with 32 Mb RAM, running under Windows 95
  • compiler: Microsoft Visual C++ v5.0.

Results of the measurements are shown in Figure 2, Figure 3, and Figure 4.

The most interesting results of testing are:

  • The LZH-Light algorithm is much more efficient (with both a higher speed and a higher compression ratio) than Deflate when low memory usage (less than 64 Kb for the compressor and the decompressor combined) is required.
  • The LZH-Light algorithm can work efficiently with very low memory usage (as little as 6 Kb for the compressor and the decompressor combined), achieving a high compression speed (up to 1 Mb/sec on the Pentium-100 processor) and a relatively high compression ratio (up to 2.1 on standard Calgary Corpus data).
  • LZH-Light is much more efficient (has a lower memory usage, a higher speed, and a higher compression ratio) than Deflate when using small packets (with the average packet size 512 bytes or less).
  • LZH-Light is more efficient (has a lower memory usage and a higher compression ratio) than Deflate when high compression speed is required.
  • Although LZH-Light lacks an escape mechanism for uncompressable data (making it too expensive for small data packets), it adds very little to the size of uncompressable data blocks (1.0-1.5 percent).

Uses for LZH-Light

The basic field for LZH-Light application is online compression of reliable data streams. At present, one earlier modification of this algorithm is used in a prominent Russian electronic stock market system named RTS (http://www.rtsnet.ru). Data in this system is transferred over a Wide Area Network and can be efficiently compressed. On the other hand, security requirements require encryption, which must be performed after compression. Applying the LZH-Light algorithm by itself reduced the network traffic by a factor of 4-4.5. Since LZH-Light is five to six times faster than the encryption algorithm (DES), and provides a compression ratio of 4-4.5, the total processing time for a data block is reduced by a factor to 2-2.5. Thus, in this particular case, applying the compressor reduced both network traffic and CPU usage.

As an example of an LZH-Light application, Listing 3 shows the interface to a lzhl_tcp library. This library wraps TCP with a compression/decompression layer, based on the LZH-Light algorithm. Using this library is identical to using standard TCP. The only (and obvious) limitation is the requirement of using this wrapper on both sides of the TCP connection.

The LZH-Light algorithm can be applied to more than communication channel compression. It is potentially useful for any application for which high compression speed and low memory usage are essential.

References

[1] P. Deutsch. RFC1951: DEFLATE Compressed Data Format Specification version 1.3.

[2] P. Deutsch, J.L. Gailly. RFC1950: ZLIB Compressed Data Format Specification version 3.3.

Sergey Ignatchenko is a senior software developer working on a major Russian electronic stock trading system named RTS (http://www.rtsnet.ru). He specializes in Object-Oriented design and development of distributed systems, and can be reached at [email protected].


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.