Parallel

Algorithm Alley

By Ron Gutman, September 01, 2000

Ron presents techniques for exploiting the parallelism of bitwise operations to speed up some kinds of 64-bit computing tasks.

Sep00: Algorithm Alley

Exploiting 64-Bit Parallelism

Ron is a senior software engineer for AirFlash. He can be contacted at ron@ airflash.com.

Sixty-four bits happen. Intel's 64-bit Itanium processor is being sampled, and Java anticipates 64-bit architectures with its 64-bit integer. Besides performing 64-bit memory fetches, stores, and arithmetic, you can expect 64-bit processors to perform shift and bitwise operations 64 bits at a time. The 64-bit operations provide some parallelism -- twice as much as 32-bit CPUs offer -- to be exploited. In "The Fastest Sorting Algorithm?" (DDJ, February 2000), Stefan Nilsson presents an algorithm that uses bitwise operations to perform a merge sort. In this article, I'll present my technique for exploiting the parallelism of bitwise operations to speed up some other kinds of computing tasks.

Bit Reversal in log N Time

To illustrate, how about reversing the order of bits in a value? The obvious approach has execution times that are linear in the number of bits; that is, one iteration of a loop per bit. In contrast, the technique I describe here needs only log2(N) iterations for N bits. That means four iterations for 16 bits, five iterations for 32 bits, and six iterations for 64 bits. So, as the word size of the CPU becomes larger, the technique becomes more compelling.

Why would you reverse the order of bits in a 32-bit or 64-bit value? Bit reversal can be useful in a variety of contexts. It's useful in image processing for flipping a black-and-white image to create a mirror image. To flip an image horizontally, the pixels in a row of the image must be placed in reverse order. Fast 64-bit bit reversal can be used in the process of reversing a row of black-and-white pixels. Similarly, 64-bit bit reversal is useful for rotating a black and white image 180 degrees.

Because Java guarantees that a long integer is 64 bits long, all code examples I present here are static Java methods. The code for C functions would be similar, but if you want to take advantage of 64-bit CPU operations in C, your code would not be portable to all C platforms since not all support 64-bit values. The Java methods will work even when the underlying hardware does not provide 64-bit operations.

Implementations of Bit Reversal

The reverse method in Listing One reverses the order of bits in a 64-bit value. This simple method reverses one bit per loop iteration. (C programmers: The >>> operator is just a right shift without sign extension; bits vacated on the left are filled with zeros. This is a workaround for the absence of an unsigned integer in Java.)

Listing Two is another implementation of reverse that uses a divide-and-conquer approach. It uses a recursive method, reversen(V,n), presented as pseudocode in Example 1. The method breaks n bits of V into an upper and lower part, reverses each part, then puts the two parts back together in reverse order.

As with all the recursive calls, this algorithm is very slow. What's important is the way it subdivides the problem. For 64 bits, seven levels of recursion are needed. At the top level, n=64. Below that are levels where n=32, 16, 8, 4, 2, and 1. A few bitwise operations can perform all of the reversals of a single level in parallel. For example, the following Java code performs all of the reversals for n=8:

bits = ((bits&0x0f0f0f0f0f0f0f0fL) << 4) |

((bits&0xf0f0f0f0f0f0f0f0L) >>> 4);

At this level, 4-bit groups exchange places with adjacent 4-bit groups; some groups are shifted left, some are shifted right. All that's needed to perform the entire reversal are six statements like the one above -- one statement for each level of recursion, except the bottom level, which only serves to terminate the recursion. Listing Three shows the bit parallel version of the reverse method. In effect, the recursion has been unrolled into six inline statements. The result is very fast, about 10 times faster than the linear method of Listing One.

To be fair, a table could be used to speed up the linear approach by reversing many bits at a time with a single table lookup. But there are complications. For one, a table must be initialized. Listing Four shows an implementation that uses a table lookup to reverse 8 bits at a time. This time, the reverse method is not a static method. It must be called using an object of the BitTable class like this:

BitTable table = new BitTable();

long result = table.reverse(bits);

Forcing the use of the constructor ensures that the table is initialized.

The extra effort does not provide any speed up compared to the bit parallel approach. In fact, I timed the bit parallel approach at 30 to 50 percent faster for 64 bits.

If 128-bit processors ever come into being, then the table approach will take twice as long as for 64 bits. In contrast, the bit parallel approach will require one more line of code and take 7/6 times as long as for 64 bits; it would leave the table approach in the dust. The table approach could counter by doing 16 bits at a time, but at the cost of a much larger table. This shows how a log N approach eventually prevails over a linear approach.

Bit Counters

In a previous job, I found a need to count the number of an integer's bits set to 1. I was writing code to manage cache slots for data representing the road network for automobile navigation. A 32-bit integer served as a bitmap for the cache slots. Each slot could be allocated to a disk block. Each cached disk block could participate in one or more regions of the road network which I called "neighborhoods." A neighborhood in the cache was represented by a 32-bit bitmap. When new cache slots were needed for a new neighborhood, I wanted to choose them in such a way as to keep the most recently used neighborhoods in the cache. I used a sequence of bitwise operations on 32-bit bitmaps to produce bitmaps each representing a possible solution, that is, which slots to free up and which to leave alone. I needed to count bits in these bitmaps to determine whether enough cache slots would be freed for the new neighborhood.

Bitmaps are a common technique and, no doubt, the utility of counting the bits in them arises in contexts more easily described than my neighborhood caching. Some relational databases use bitmaps to index data that satisfy given conditions. Bit counting could be used to respond to SQL queries that ask for a count of records satisfying the conditions.

The count method in Listing One shows the linear method for counting bits. It's short, but not fast.

At first, it seems that a bit parallel approach cannot be applied here. However, the recursive approach can be easily modified to count bits; see Example 2. But how can bitwise operations total the bit counts in parallel?

The solution hinges on the order in which the six levels are handled. The six statements in the bitwise reverse method can be performed in an order reversed from that shown in Listing Three. In fact, they can be performed in any order and still yield the same results. The bit counting function, in contrast, cannot be performed in the order from the top level to the bottom level, but it can be performed in the reverse order, bottom to top. Listing Three shows how. Before the first statement of count in Listing Three executes, each bit is its own count. If a bit is 1, there is 1 bit set at that bit position. If it's 0, there are 0 bits set. The first statement aggregates the counts for individual bits into counts for pairs of bits. The second statement aggregates the counts for pairs of bits into counts for 4-bit groups. This continues until the last statement, which adds the count for the upper 32 bits with the count for the lower 32 bits to yield the grand total.

The parallel addition works because, at each stage, a count of bits cannot overflow into the next count. The count for 2 bits, for example, is at most 2 and never overflows into a 3rd bit.

Morton Keys

The mortonKey method in Listing Three computes Morton keys. (For more information on Morton keys, see my article "Space-Filling Curves in Geospatial Applications," DDJ, July 1999, where I explain how they can be used in spatial indexing. )

Morton keys are computed from the coordinates of a point. If the point lies in a two-dimensional xy plane, the Morton key is computed by interleaving the bits of the x- and y-coordinates. The mortonKey and spreadBits methods in Listing Three show how to use a bit parallel approach to compute a Morton key from two 32-bit values, x and y. The actual interleaving is performed by mortonKey with just a bitwise OR operation. The hard work is performed by spreadBits, which takes a 32-bit value and spreads its bits out into the even numbered bits of a 64-bit value. After doing this to x and y, it only takes a shift and a bitwise OR to interleave them.

Figure 1 shows the masks used to spread 16 bits farther apart with each iteration of the algorithm until they occupy the even-numbered bits of a 64-bit value.

Conclusion

I've described a technique that uses the parallelism of bitwise operations to turn some computing tasks requiring linear time into computing tasks requiring only log N time. If you discover new ways to apply this bit parallel technique, it would be fun to hear about them, so drop me an e-mail.

DDJ

Listing One

public class BitLinear  {
   public static long reverse (long bits) {
        long rl = 0;
        for (int i = 0; i < 64; i++) {
           rl = (rl << 1) + (bits & 1);
           bits = bits >>> 1;
        } 
        return rl; 
   }   
   public static int count (long bits) {
        int cnt = 0;
        while (bits != 0) {
            cnt += bits & 1;
            bits = bits >>> 1;
        } 
        return cnt;
   } 
}

Back to Article

Listing Two

public class BitRecursive
{
   // reverse leftmost n bits of V
   static long reversen (long V, int n) {
        if (n <= 1)
            return V;
        int n2 = n/2;
        // reverse rightmost n/2 bits
        long right = reversen( V & ((1L<<n2)-1), n2); 
        // reverse lefttmost n/2 bits
        long left =  reversen( V >>> n2, n2); 
        // combine in reverse order
        return (right << n2) | left;                  
   } 
   public static long reverse (long bits) {
        return reversen (bits, 64);
   } 
}

Back to Article

Listing Three

public class BitLogN {
   public static long reverse (long bits) {
       // >>> fills bits on the left with 0 (no sign extension)
       bits = ((bits&0x00000000ffffffffL) <<  32) | 
              ((bits&0xffffffff00000000L) >>> 32);
       bits = ((bits&0x0000ffff0000ffffL) <<  16) |
              ((bits&0xffff0000ffff0000L) >>> 16);
       bits = ((bits&0x00ff00ff00ff00ffL) <<   8) |
              ((bits&0xff00ff00ff00ff00L) >>>  8);
       bits = ((bits&0x0f0f0f0f0f0f0f0fL) <<   4) | 
              ((bits&0xf0f0f0f0f0f0f0f0L) >>>  4);
       bits = ((bits&0x3333333333333333L) <<   2) |
              ((bits&0xccccccccccccccccL) >>>  2);
       bits = ((bits&0x5555555555555555L) <<   1) |
              ((bits&0xaaaaaaaaaaaaaaaaL) >>>  1);
       return bits; 
   } 
   public static int count (long bits) {
       bits = (bits&0x5555555555555555L) +
             ((bits&0xaaaaaaaaaaaaaaaaL) >>>  1);
       bits = (bits&0x3333333333333333L) +
             ((bits&0xccccccccccccccccL) >>>  2);
       bits = (bits&0x0f0f0f0f0f0f0f0fL) +
             ((bits&0xf0f0f0f0f0f0f0f0L) >>>  4);
       bits = (bits&0x00ff00ff00ff00ffL) +
             ((bits&0xff00ff00ff00ff00L) >>>  8);
       bits = (bits&0x0000ffff0000ffffL) +
             ((bits&0xffff0000ffff0000L) >>> 16);
       bits = (bits&0x00000000ffffffffL) +
             ((bits&0xffffffff00000000L) >>> 32);
       return (int) bits;
   } 
   public static long mortonKey (int x, int y) {
       /* In C++, the calls to spreadBits could be made in-line    */
       /* to avoid function call overhead.                         */
       /* In C, make the function a macro (admittedly an ugly one) */
       return (spreadBits(x) << 1) | spreadBits(y);
   } 
   // For j = 1 to 31, shift bit j j positions to the left 
   static long spreadBits (int i) {
       long bits = i;
       // shift bits 16 to 31 16 bits
       bits = (bits & 0x000000000000ffffL) |
             ((bits & 0x00000000ffff0000L) << 16);
       // shift originally odd-numbered bytes 8 bits
       bits = (bits & 0x000000ff000000ffL) |
             ((bits & 0x0000ff000000ff00L) <<  8);
       // shift originally odd-numbered nibbles 4 bits
       bits = (bits & 0x000f000f000f000fL) |
             ((bits & 0x00f000f000f000f0L) <<  4);
       // shift originally odd-numbered bit pairs 2 bits
       bits = (bits & 0x0303030303030303L) |
             ((bits & 0x0c0c0c0c0c0c0c0cL) <<  2);
       // shift originally odd-numbered bit pairs 1 bits
       bits = (bits & 0x1111111111111111L) |
             ((bits & 0x2222222222222222L) <<  1);
       return bits;
   } 
}

Back to Article

Listing Four

public class BitTable {
   short[] table = new short[256];
   public BitTable() {
       BitLinear lin = new BitLinear();
       for (int i = 0; i < 256; i++) {
           table[i] = (short) (lin.reverse(i) >>> 56);
      } 
   } 
      public long reverse (long bits) {
       long rl = 0;
       rl =             table[(int)(bits & 255)]; bits = bits >>> 8;
       rl = (rl << 8) | table[(int)(bits & 255)]; bits = bits >>> 8;
       rl = (rl << 8) | table[(int)(bits & 255)]; bits = bits >>> 8;
       rl = (rl << 8) | table[(int)(bits & 255)]; bits = bits >>> 8;
       rl = (rl << 8) | table[(int)(bits & 255)]; bits = bits >>> 8;
       rl = (rl << 8) | table[(int)(bits & 255)]; bits = bits >>> 8;
       rl = (rl << 8) | table[(int)(bits & 255)]; bits = bits >>> 8;
       rl = (rl << 8) | table[(int)(bits & 255)];           
       return rl; 
   }  
}

Back to Article

1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Algorithm Alley

Exploiting 64-Bit Parallelism

Bit Reversal in log N Time

Implementations of Bit Reversal

Bit Counters

Morton Keys

Conclusion

Listing One

Listing Two

Listing Three

Listing Four

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Algorithm Alley

Exploiting 64-Bit Parallelism

Bit Reversal in log N Time

Implementations of Bit Reversal

Bit Counters

Morton Keys

Conclusion

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content