Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Random Access Data Compression


September 1997/Random Access Data Compression

Random Access Data Compression

Philip Gage

Need to compress text but still jump about within it? This simple compression scheme may do the trick.


This article describes a data compression algorithm called Byte Pair Encoding (BPE). BPE has not only excellent decompression-time performance but some unusual features as well. In particular, BPE supports direct access to compressed data, allowing a program to jump to any byte in a compressed file and retrieve or modify data at that point. I illustrate this process by presenting portable C programs for compression and decompression, and a text file browser program that demonstrates the random access technique.

BPE in a Nutshell

I described BPE at length in a previous article (see "A New Algorithm for Data Compression," CUJ, February 1994). Since space is limited, and since I want to focus on the random-access feature of BPE, I will just give a refresher of BPE here. I encourage readers who want more detail to consult the original article. Also, an expanded illustration of the BPE algorithm is available (along with this article) on the CUJ web site. See http://www.cuj.com.

The BPE algorithm is a general-purpose lossless compression method suitable for all types of data, although this article concentrates on text-file applications. The compression process is very simple. Unlike many other compression methods, BPE does not require variable-length bit packing or complex data structures. BPE loads an entire file or a block of data into memory and compresses it in multiple passes. The data is reanalyzed after each pass to account for changes as it is compressed.

BPE operates by finding the most frequently occurring pair of adjacent bytes in the data and replacing all instances of the pair with an unused byte that does not occur in the data. The compressor repeats this process until no more compression is possible, either because there are no more frequently occurring byte pairs or because there are no more unused bytes to represent pairs. The compressor then writes the resulting table of byte-pair substitutions to the output, immediately preceding the packed data. The following pseudocode illustrates the compression process:

Read file into buffer
While (compression possible)
{
   Find most frequent byte pair
   in buffer

   Add pair to table and assign it
   an unused byte

   Replace all such pairs with this
   unused byte
}
Write pair table and buffer

The decompressor first reads the pair table from the input stream, and stores the table in a buffer. The decompressor then processes each byte of compressed data, maintaining a small stack of partially expanded bytes. The next byte to be processed is popped from the stack, or if the stack is empty, read from the input stream. If the byte is a pair code, the pair of bytes is looked up in the table and pushed onto the stack; otherwise the byte is output as a normal character. The pseudocode below illustrates this decompression process.

Read and store pair table in buffer
While (stack not empty OR not end of file)
{
   If stack empty, read byte
   Else pop byte from stack
   If byte is pair code,
       push pair on stack
   Else write byte
}

Figure 1 shows the pair transformations that occur during compression of or decompression to the string "ABABCABCD." (This is the same data used for the examples on the web site.) You can read this diagram downwards for compression and upwards for decompression. This illustration shows why BPE supports random access. Each byte in the compressed data (bottom line) is an independent code that corresponds directly to one or more bytes in the original data. Thus, any section of compressed data can be decompressed on the fly. In addition, any part of the compressed data can be modified without affecting the rest of the data.

BPE in Code

I've included a sample compressor, compress.c (Listing 1) and decompressor, expand.c (Listing 2) to help illustrate how BPE works. For simplicity, the C code accompanying this article compresses only small text files and includes only minimal error handling. This code compresses typical text files to roughly half of their original size, but of course the actual amount of compression depends on the data being compressed.

This sample code relies on the fact that the ASCII character set uses only codes from 0 through 127. That frees up codes 128 through 255 for use as pair codes. Assigning pair codes in numerical order allows the byte pair table to be stored with only two bytes per pair. The byte value that represents each pair is indicated by the pair's position in the table. For a description of this code, please consult the sidebar.

The sidebar also briefly describes modifications to improve compression, to handle large files, and to handle binary data. These modifications were fully implemented in my previous article [1] . The full source code implementing these features is available on the CUJ ftp site under this month's code listings (see p. 3 for downloading details).

Random Access Decompression

Few commonly used compression algorithms allow random access to compressed data. Virtually all modern compression methods use adaptive models and generate variable-bit-length codes that must be decoded sequentially from beginning to end [2] . In this case, random access to large data sets can be provided only by dividing the data into smaller files or blocks and compressing each chunk separately. Even if an application is sophisticated enough to decompress separate blocks of data (the conventional technique), it still must begin decompressing each block at the start of that block.

There is a simple text compression trick that allows random access; it employs the unused high-order bit in ASCII characters to indicate that a preceding space character (the most common character in text files) has been removed. This technique in effect removes all single spaces and reduces runs of consecutive spaces to half length, compressing typical text by about 15 percent. Another simple approach is to encode common words as one byte using a fixed dictionary of the most frequently used words. But schemes such as these are not general-purpose and have limited usefulness.

BPE is a universal compression algorithm that supports random access for all types of data, not just text. The global substitution process of BPE produces a uniform data format that allows decompression to begin anywhere in the data. Using BPE, data from anywhere in a compressed block can be immediately decompressed without having to start at the beginning of the block. This can provide very fast access for some applications. It's like being able to grab a volume of an encyclopedia off the shelf and open it to the exact page you want without having to flip through all the earlier pages in the volume.

The text file browser program browse.c (Listing 3) illustrates how BPE can be used for direct access to compressed data. The program accepts a single command-line argument, the name of a file to view. This program accepts both normal text files and compressed files. The browser displays the current screen of text and the location in the file as a percentage, from 0 at the top to 99 at the bottom. The user then enters other percentage values to view any part of the file. Entering a number outside the 0 to 99 range exits the program.

The browser uses fseek to jump to a specified location in the file and then decompresses a screen of text on the fly. Jumping to a random location generally results in a partial first line of text being displayed. This could be handled simply by not printing characters until after the first newline character is seen. The browser could be improved by adding a Graphical User Interface (GUI) with a scroll bar to set the percentage and allow the user to scroll through a file by dragging the scroll bar with a mouse.

LINESPERPAGE determines how many lines of text are printed per screen. This value may be adjusted as needed. This program assumes that the previous text will scroll off the top of the screen. On some systems, a clear screen call may improve operation.

Modifying the browser to scroll forward an exact number of lines or pages is easy; just decompress and count newline characters. Scrolling backwards a specified number of lines is also easy using another strange BPE feature, decompression can be done backwards! This is done by processing compressed bytes backwards, storing the generated data backwards and reversing the order that the pair bytes are pushed on the stack. Thus newlines can be counted backwards from the current position while storing the unpacked data backwards in a buffer.

Possible applications

The BPE method is useful for applications requiring small, fast decompression code, such as image display, self-extracting programs, and embedded systems. It could also be useful for communication links, since worst-case data expansion is low and it operates on discrete packets of data.

BPE is a relatively fault-tolerant compression scheme. If a few bytes of input to the decompressor are corrupted by transmission errors, only a similar fraction of output bytes are affected, and only in the current block of data. Thus, BPE is may be more suitable for real-time mission-critical and data-transmission applications than most other compression algorithms.

A useful modification to the text browser technique would be to enable hypertext access to regions within the compressed file. The compression program would create an index file of byte offsets to topic names, chapter headers, and other selected phrases within the compressed file. The browser would then highlight these phrases and use the byte offsets to jump to the selected links. Note that it should also be possible to implement a text editor based on the browser technique that can modify, insert, or delete text almost as easily as a normal editor.

Finally, BPE could be useful for managing a compressed database. In this scheme, each database table would have its own BPE pair table, allowing delete, insert, and update of records by reusing the same pair table for compressing new data. Of course, the tables may require re-compression after extensive updating, to restore the compression ratio. This approach also supports binary search on compressed data.

Summary

The BPE algorithm has several significant advantages over typical compression methods:

  • simple, small, and fast decompression code
  • very low worst-case data expansion ratio
  • fault-tolerance to data transmission errors
  • easily tuned speed vs. compression performance
  • random read/write access to compressed data
  • forward or reverse decompression, beginning at any byte within the compressed data

In contrast to the example code, these techniques are not limited to the compression of small text files. BPE can be used on any type or size of data set by using buffering and unused byte tests as described in Reference [1] .

The main disadvantages of BPE are slow compression speed and a lower compression ratio than some other methods. However, slow compression with fast decompression is acceptable for many applications. Random access to compressed data adds a new twist that may outweigh the disadvantages in many cases.

References

[1] Philip Gage. "A New Algorithm for Data Compression," The C Users Journal, February 1994.

[2] Mark Nelson. The Data Compression Book (M&T Books, 1991).

Philip Gage has a BS degree in computer science and works as a software engineer in Colorado Springs, CO. He can be reached at [email protected].


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.