Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Random Access Data Compression


September 1997/Random Access Data Compression/Sidebar

BPE Code Description


Listing 1 shows the compression program compress.c. The critical issue in the compress function is how long it takes to find the most frequent pair. A brute force search requires no additional data structures but gives ridiculously slow performance. On the other hand, a 256x256 array can be used to hold the counts, but occupies 64K of memory as unsigned char and 128K as short. The approach I use in this program is to count pairs in a hash table and then search the table for the highest count.

The compressor reads the entire input file into a buffer and compresses it in place. An outer loop tries to perform 128 global substitutions, replacing frequent pairs with unused characters in the range 128 to 255. Byte pairs are tallied in a hash table; each entry in the table consists of a left byte value, right byte value, and a count. The compressor searches the hash table for the highest count. If highest count found is large enough, the compressor replaces all occurrences of the pair in the buffer with the associated pair code. Otherwise, the break statement ends the loop. After completion of the loop the compressor writes the last pair code value, the pair table, and the packed data to the output file.

The compressor creates a complete pair table and data buffer after each substitution pass, so compression can be stopped after any pass. Since the most frequently occurring pairs are compressed first, quitting after the first few passes can greatly speed up the process while still obtaining reasonable compression. Thus you can easily tune the time/space performance by modifying the THRESHOLD value in compress.c.

This program uses dynamic allocation via malloc to avoid memory limitations that would be imposed by some systems. You can adjust the input buffer size MAXSIZE, and the hash table size HASHSIZE, as necessary. MS-DOS allows a MAXSIZE of only 64KB, but other systems may allow larger values. On MS-DOS systems, compile compress.c with a large memory model or reduce MAXSIZE. Setting HASHSIZE too small may result in hash table overflow. (It would probably be a good idea to add some error checking here.)

The decompression program expand.c (Listing 2) is very simple and fast, using only 276 bytes for variables. In this respect the decompression algorithm differs from most compression schemes, where the compressor and decompressor maintain the same large data model in synchronization. In BPE, it is the compressor that does most of the work. The compressor packages the data nicely to simplify decompression.

The first byte of the compressed data file represents the number of pairs in the pair table (which immediately follows). The number of pairs, which can range from 0 to 127, is encoded in this first byte as a value ranging from 128 to 255 respectively. This encoding scheme allows the decompressor to handle a normal text file in the same way as a compressed file. If the first byte is less than 128, then no pair table is expected and the first byte is output as is. Since the normal BPE decompression process passes non-pair-code bytes through unchanged, the program can accept both compressed and uncompressed files.

A Complete BPE Implementation

These programs can be modified to handle large files by buffering blocks of data and compressing each block separately. Each block is written with its pair table and the number of packed data bytes that follow. This technique also supports the use of data streams and can provide better compression by allowing local adaption to data in each block.

Similarly, binary files can be handled by adding bytes to the buffer until it is full, or only a small number of unused bytes remain to represent pair codes. Since these unused codes are typically scattered around the character set, the pair table can be written as three bytes per substitution, two for the pair and one for the unused byte that replaces the pair.

The original BPE code in Reference 1 implements both buffering and binary data processing into a complete general-purpose compression system. This code is available on the CUJ ftp site with this month's code listings (see p. 3 for downloading instructions). This version can handle files of any size containing any type of data. It compresses text files somewhat better than the simplified version presented in this article by exploiting all unused characters in the entire character set, not just the upper half. In comparison to typical LZW (Lempel Ziv Welch) programs, this original BPE implementation has slower compression time, faster decompression, a similar compression ratio on typical binary data and small text files, but a slightly lower ratio on large text files.

BPE's worst case data expansion amount on truly random data is only about 0.2 percent, while many other compression schemes may greatly inflate such data. If the data cannot be compressed, it is passed through unchanged except for the addition of a few header bytes for each block. o


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.