A Random Access Compressed File Layer

, July 01, 2001

Providing efficient random access to compressed data is a tall order. Here is a library that does just that, and it's patterned after POSIX file I/O.

July 2001/A Random Access Compressed File Layer

A Random Access Compressed File Layer

Kyle York

Providing efficient random access to compressed data is a tall order. Here is a library that does just that, and it’s patterned after POSIX file I/O.

Introduction

Even as hard drives continue to get larger and faster, there are still many situations that require data to be kept as small as possible. I recently ran into this very problem when a database plus associated indexes grew to better than 2 GB, which exhausted the available space. Databases consist of mostly whitespace, and index pages are normally only half used at best, so this application seemed a prime candidate for compression. Unfortunately, compressed data is usually not randomly accessible. To get the data at offset n, it is necessary to uncompress all data in offsets 0 through n-1. To get around this, a program must compress data in discrete blocks, keeping a list of indexes that map each block to an offset within the file. This solution quickly fails if the blocks are rewritten often (as happens during index updates) because chances are the newly written data, when compressed, will not take up exactly the same number of bytes as the original.

The solution is to create a structured file in which the space overhead is very small and overwhelmed by the savings from the compression. This article describes a file I/O library I developed for providing random access to compressed file data. The library provides functions equivalent to the POSIX [1] open, seek, read, sync, and write file I/O functions. If you are familiar with these functions, you can use the functions in this library in much the same way. To give you a rough idea of the compression/efficiency tradeoffs possible: using ZLIB, a freely available compression package, I was able to achieve compression to 65% of original size with I/O taking 3-4 times as long as normal.

The File Structure

Figure 1 illustrates the structure of the compressed file. It is made up of fixed-sized blocks and consists of a header and one or more sections. Each section is further divided into parts, each of which contains one or more full blocks: the BAT (block allocation table), cluster mapping table, and cluster data. New sections are created as necessary to contain all of a file’s data, eliminating the need to preallocate the file. In the following description, references to “file” mean the structured, compressed file, while references to “embedded file” mean the data returned by the API.

The header is divided into five sections, each of which is maintained by the subsystem to which it pertains. This allows a program to modify any of the subsystems without having to touch any of the others. Since the header and subsystems are logically attached, I describe both at the same time as follows:

The first part of the header contains basic file information: a magic number, the file version, the highest location written to the embedded file, the number of blocks required to store the header, and a dirty flag. When the file is open, a portion of it, including the header, is maintained in a RAM cache. When a file is updated, the dirty flag is set and the header is flushed to the disk. When the file is sync’d, the dirty flag is reset. Since part of the file is cached during updates, it is necessary to know the integrity of the file when opening it. If the dirty flag is set when the file is opened (the file was not closed correctly), the file is considered unrecoverable. (This is solely because I have not needed a recovery tool. In practice, much of the data is recoverable, but its integrity is unknown).

The Block Subsystem

The block subsystem is responsible for all low level I/O. Its header contains the size of a block (in bytes) and the total number of blocks in the file. Blocks are referenced by logical number, 0 being the first block of the header, 1 the block that begins at (block size) bytes from the beginning of the file, etc. There are no restrictions on block size, but normally it is chosen to be either the size of the blocks used by the underlying file system, or a multiple to avoid unaligned access. It should also be large enough to overcome the per-block overhead imposed by the block allocation table, which is four bytes. For example, having the block size set to one byte means the overhead is 400%, whereas having the block size set to 512 bytes yields an overhead of only about 1%.

BAT (Block Allocation Table)

The block allocation table provides a 1:1 map to the cluster data entries, where BAT(n) refers to cluster data block n. The BAT works much like the MS-DOS File Allocation Table in that it is simply a linked list, where the link value at BAT(n) points to the next BAT entry used by the cluster. A link value of -1 signifies the end of the list. The BAT header consists of the number of blocks per BAT, the number of sections with initialized BATs, the number of BAT entries in use (for allocating new entries if the free list is empty), and the first entry in the free list.

Cluster Mapping Table

The cluster mapping table provides a 1:1 map from the embedded file’s cluster to the first BAT entry used by it. MAP(n) refers to the data at the embedded file position n * cluster size. If the entry is the end-of-list value, data has never been written to this cluster. Otherwise, if the high bit is set, the cluster is compressed; if not the cluster is not compressed. The MAP header consists of the number of blocks per MAP, and the number of sections with initialized MAP tables (which will either be equal to, or one less than the number of BAT sections).

Cluster Data

Finally, the actual data. The smallest unit of the embedded file is called a cluster, which consists of zero or more blocks and can be either compressed or not. If compression does not save at least one block, the cluster is written uncompressed. The only time a cluster can require zero blocks is if no data has ever been written to it, in which case all reads return a stream of zeros. The cluster data area is where all of the actual data is stored. Its header consists of the number of blocks per cluster.

The Code

Throughout the code, physically writing clusters or tables is an implicit function and generally occurs during reads (which is to say, most everything is cached). This simplifies the code quite a bit, and partially explains why there are not any internal write functions, but rather flush functions. If a read request is made for data not already in the buffer, first the buffer is flushed (if dirty), then the read occurs.

The two tables (BAT and MAP) share much of the same code (see Listing 1, cdc_tbl.h). The basic table consists of: number of blocks, phantom value (value to return when a table entry is read that has never been written), block offset from the beginning of a section, and table create function. Even though a table is viewed externally as a single contiguous entity, only the data from a single section is buffered.

The table create callback is called when a new table is required. For BATs, this function simply initializes all entries to the end-of-list value. For MAP tables, this function must first create the BAT for this section if it does not exist, then initializes its table to all end-of-list values.

Function tbl_entry_get returns a table entry. If the requested entry is from a table that has never been written, the phantom value is returned. If it is from a table section other than the one in memory, the current table section is flushed (if dirty) and the new one read.

Function tbl_entry_set first retrieves the existing table entry with a call to tbl_entry_get, and then sets the new value to an entry in the table if the value written is different or the section has never been written.

Writing a Cluster

The cluster flush function (see Listing 2, cdc_data_flush) first attempts to compress the data buffer. The cluster is then written to disk one block at a time, either reusing previously allocated BAT entries or allocating new ones via function cdc_bat_entry_alloc.

Finally, if there are any extra links in the chain when the write is complete (i.e., a modified cluster compressed to less than its original size), they are added to the free chain. In my case, this is rather common occurrence as the embedded file is a B-tree, and when a page is split, it usually compresses much better than before the split.

Reading a Cluster

Reading a cluster is much simpler because no extra book work is involved. First the MAP table entry is read (since it maps 1:1 with the clusters, cluster n begins at MAP table entry n). If this is the end-of-list value, simply zero the cluster buffer, otherwise read all blocks listed in the BAT chain. These are read into the compress buffer if the high bit is set in the MAP table entry, or directly into the cluster buffer if it is not compressed. Finally, if it’s compressed, decompress it. The decompressed size should be exactly the cluster size, otherwise an I/O error (or worse, a compressor bug) has occurred.

The API

The API is implemented in file cdc_api.c (available on the CUJ website at <www.cuj.com/code/>). It is POSIX compatible, with the exception of the open function (see below). I chose POSIX because it seemed easier to simply emulate an existing API then define a new one.

comprio_open

The only difference between comprio_open and the POSIX open function is that comprio_open takes an environment as its first parameter. comprio_open is also one of the most complicated API routines because of the O_TRUNC flag. To truncate the file, first the file is opened. If this is successful, all file parameters are taken from the file (and the parameters in the environment are ignored).

The environment allows the programmer to adjust virtually all of the working parameters (block size, BAT size, MAP table size, cluster size). It also allows for replacement of all I/O and compression primitives.

Being able to replace the I/O primitives allows a programmer to chain several different I/O libraries. For example, this library was written to be placed under a database management library, and you could easily place an encrypted I/O library under this one. By default, this library uses the standard POSIX file functions underneath.

The default compression library is ZLIB, which is free and available at: <http://www.info-zip.org/pub/infozip/zlib>. ZLIB is very robust and well tested. It allows 10 levels of compression (0 = none, 9 = maximum). Although the default level is 6, I found using level 1 allows for the best performance/compression ratio when using local storage (hard drive or CD-ROM). Level 1 is an order of magnitude faster than 6 (several orders faster than 9) and usually provides a space savings within 3-5% of level 6’s. If the application was using a slow-data data link (such as an NFS or SMB mount over a wide area network), the higher compression ratios might be justified.

Adjusting the BAT and MAP table sizes is something of an art. One section of each table is kept in memory, as are two cluster buffers: one compressed and one uncompressed. The default values are block size 512; blocks/cluster 64 blocks, or a cluster size of 32K; MAP table size 1 blocks, or 128 entries; and BAT size 32, or 4,096 entries.

Assuming zero compression, each MAP table entry will take (blocks/cluster) BAT entries. If either table runs out of entries, a new section must be started. In this example, one section is: 1 block (MAP table) + 32 blocks (BAT) + (128 * 32) (cluster data) = 4,129 blocks. It is better to run short of BAT entries than MAP table entries because running short of the latter requires creating a new section even though the cluster data area is not full, and the overhead for an extra MAP table is nominal (around 0.02% in this example).

comprio_seek

comprio_seek simply adjusts the next-read offset.

comprio_sync

comprio_sync flushes all dirty buffers. The flush order is important since flushing the data can modify both the MAP table and the BAT.

comprio_write

comprio_write takes care of alignment. There are three possible cases when writing: unaligned writes happen when the file pointer is not sitting at a cluster boundary; short writes occur when the amount of data written is less than one cluster; and there is the trivial case when one full cluster is written at a cluster boundary. For the former two cases, the existing cluster must first be read before writing can occur. This read will flush the cluster buffer if necessary. If the read is skipped (trivial case), the flush must be done explicitly.

comprio_read

comprio_read must also take care of buffer alignment, but essentially does nothing more than read and copy.

Typical Performance

Using the ZLIB compression library on my database files, the compressed files were typically 65 percent as big as the originals. Compressed writes took about four times longer, and reads took about three times longer for a simple file copy operation. Here are some results for varying levels of ZLIB compression:

level   compressed size     relative
                            speed
____________________________________
9       55.98 percent       1
6       56.27 percent       5.5
1       56.96 percent       13

where:

Level 9 achieves best compression, 6 is default compression, and 1 achieves best speed (13 times faster than level 9).
Percentages are compressed size/original size.

Potential Optimizations

As presented, the code has not been optimized in any way other than with very rudimentary caching. By profiling my applications, I have found that this housekeeping code consumes only a few percent of the total read/write time, but there are still a few things worth considering:

The free list is kept as a simple ordered list. This fact should be used to allow allocations of multiple contiguous blocks and keep allocated blocks in the same section as the associated MAP table.
No check is made for writing a cluster of all zeros. Since reads of an unwritten cluster (one with a MAP table entry of -1) return all zeros, an extra BAT entry (and data entry) could be saved in this case.
Cache multiple BAT sections: when allocating two BAT entries at a section boundary, some thrashing takes place. Assume section 0 is in memory, and the next allocation comes from section 1. First section 0 is written (if dirty), and then section 1 is read or initialized. If initialized, it must be flushed. The BAT entry number allocated is then written back to section 0. So, a single BAT entry allocation can require three writes and two reads.

Conclusion

Even in these days of multi-gigabyte disks, data continues to grow to fill all extra space, so basic file compression is still alive and well. By allowing random access into compressed files, this library simply allows the user to go one step further than simple file compression.

Note

[1] POSIX is a set of standards developed to support interoperability among different operating systems. One of the POSIX standards specifies a common set of APIs for accessing file systems. For more information on POSIX, see <http://anubis.dkuug.dk/JTC1/SC22/WG15/>.

Kyle York has a BA in Mathematics from the University of California in Santa Cruz. He has been programming for 20 years, most recently doing embedded development for Cisco Systems, and has published several articles dealing with MS-DOS and compression. His interests are IP multicast and tunneling protocols. In his spare time, he owns and operates a video rental store and plays with his wife Lynn and children, Alex and Adrianna. He can be reached at [email protected].

Previous 1 2 3 4

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

A Random Access Compressed File Layer