Practical Lock-Free Buffers

Tools
  • Print
  • Email
Lock-free systems enjoy freedom from deadlocks
Sarmad is a Senior Software Engineer at PalmChip Pakistan (Pvt.) Limited and an M.Phil of FAST-NU Pakistan. He can be contacted at m_sarmad_asgher@yahoo.com.


Lock-free programming frees you from system failure by tolerating individual process failures. How can a process failure cause system failure? When a process fails while holding a lock, other processes in the system requiring that lock have to wait indefinitely. On the other hand, lock-freedom (or "non-blocking behavior") guarantees that the system makes progress when some process takes few steps [1]. By definition, lock-free systems enjoy freedom from deadlocks. This property of lock-free systems can be handy in applications like API monitoring. Many malware analysis tools monitor Win32 APIs. These tools [2] work by injecting code into monitored process and use Interprocess communication (IPC) to log API calls [3]. If this IPC mechanism uses locks, then it can result in deadlock. In this case, the probability of deadlock is real because execution of monitored process can halt while some thread may be logging an API call. Scenarios which increase the likelihood of deadlock include:

  • The monitored process is killed by users.
  • The monitored process exits without waiting for all the threads to terminate.
  • A thread in the monitored process is killed by the process itself.

If lock-free data structures are used for IPC, then the system can tolerate process/thread halts. In this article, I present a multi-producer/single-consumer lock-free buffer which can be used for IPC. (The complete source code and related files are available here.) In this particular situation, the consumer is the monitoring tool which logs API calls, whereas producers are the monitored processes which produce API call records. To simplify matters, I assume consumer does not fail. This lock-free buffer is different from usual circular buffers commonly used. A circular buffer uses in and out pointers to track free and used items, whereas this lock-free buffer uses a special value to track free items. This special value is never used as data. This simplifies the problem because you don't have to worry about updating in and out pointers atomically. The declaration of the buffer is:

const int MAX_SIZE=255; int buffer[MAX_SIZE]={0};

This declares a buffer of integer items. You might be wondering how integer slots can store API call records. I build on this buffer of integers and construct a buffer which can hold strings. Returning to the integer buffer, an item with a zero value indicates a free slot. All the items in the buffer are initialized to zero. Insertion is straightforward. The insert operation iterates through the items to find an empty one, then tries to put an integer value using atomic primitive CAS. (For description of CAS, see the sidebar "CAS and ABA".) Note that writing lock-free code requires care to avoid potential pitfalls see [4] for details.

int Insert(int data) { int i=0; for(i=0;i<'MAX_SIZE;i++) { if(CAS(&buffer[i], data, 0)) return i; } return -1; }

There is another handy insert operation on the buffer which tries to put the integer value at a particular index:

BOOL InsertAt(int data,int index) { if(CAS(&buffer[index], data, 0)) return TRUE; return FALSE; }

Similarly, Delete operation iterates through the buffer to find an inserted item and tries to take it out using CAS:

int Delete(int *data) { int i=0; for(i=0;i<MAX_SIZE;i++) { *data=buffer[i]; if(*data!=0) { if(CAS(&buffer[i],0, *data)) return i; } } return -1; }

This implementation is lock-free because no process owns any slots in the buffer which means a process failure does not result in loss of any slots. As the buffer slots remain intact it is always possible to put an item if the buffer is not full. The ABA problem (see the sidebar titled "CAS and ABA") commonly faced by lock-free data structures miraculously does not cause any problems. Consider a scenario: First process reads the buffer slot to be empty, but is preempted by second process which inserts an item in that slot. This is followed by a third process which takes out that item. This in turn is followed by the resumption of the first process which finds the slot to be empty. But it does not matter if first process inserts an item in that slot. It shows that ABA problem is not an issue in this case.

CAS and ABA

CAS takes three arguments: the address of a memory location, a new value, and an expected value. If and only if the memory location holds the expected value, the new value is written to it, atomically. A Boolean return value indicates whether the write occurred:

CAS(address, new value, expected value)
    if(*address != expected value)
        return false;
    }
    else{
        *address = new value;
        return true;
    }

According to Maged Micheal in "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects" (see "The ABA Problem"):

ABA problem occurs when a thread reads a value A from a shared location, and then other threads change the location to a different value, say B, and then back to A again. Later, when the original thread checks the location, e.g., using read or CAS, the comparison succeeds, and the thread erroneously proceeds under the assumption that the location has not changed since the thread read it earlier. As a result, the thread may corrupt the object or return a wrong result.

—SA

How you can build on this integer buffer to put strings? You can put individual characters one at a time into the integer buffer. The immediate question is how to separate a character of one string from a character of another string. A simple answer is to use unique Message Number for a string. The next question is how the consumer can link these characters to form the string again. This is tricky. A simple approach can be to put index of the character in the buffer along with message number and actual character. However, this approach has a serious drawback. It will limit the length of string which can be transferred through the buffer. The approach used to overcome this limitation is to find a free slot to put the first special character along with the relative offset of the second character in the lock-free buffer. This is followed by putting the second character at the announced index along with offset of the third character and so on. This allows unlimited size strings to be transferred through the buffer. See the TransferString function in Listing 1 for complete implementation.

volatile unsigned int msgCount=0;//Number of Unique Message Numbers In Use const int MAX_NEIGHBOURS=255;// consecutive characters max distance in buffer BOOL TransferString(char *str) { unsigned int msgNumber=0; int i=0; int data; int insertedAt=-1; int nextInsertAt=-1; //Following loop tries to get a unique message number to transfer string. do { msgNumber=msgCount; if(msgNumber+1>0xFFFF) { return FALSE; } } while(!CAS(&msgCount, msgNumber+1, msgNumber)); msgNumber=msgNumber+1;//Unique Message Number //pack the message number and start of string marker into data data=msgNumber; data=data<<16; data=data|255; //Following loop finds a free slot to transfer the start of string marker while(insertedAt==-1) insertedAt=FindFreeSlot(); //Following loop finds a free slot in close vicinity of the previous slot while(nextInsertAt==-1) nextInsertAt=FindFreeSlot(insertedAt,MAX_NEIGHBOURS); //pack the information about where the next character is expected. data=data|(nextInsertAt<<8); //Following loop inserts the packed data at previously decided index while(!InsertAt(data,insertedAt)); //Calculate the absolute location in the buffer for the next character insertedAt=(insertedAt+nextInsertAt)%MAX_SIZE; //Following loop transfers all the characters in the string except null character. for(i=0;str[i]!=0;i++) { //Pack and put the messange number the current character and the offset data=msgNumber; data=data<<16; data=data|str[i]; while(nextInsertAt==-1) nextInsertAt=FindFreeSlot(insertedAt,MAX_NEIGHBOURS); data=data|(nextInsertAt<<8); while(!InsertAt(data,insertedAt)); insertedAt=(insertedAt+nextInsertAt)%MAX_NEIGHBOURS; } //Pack the message number and character and offset 0 to communicate string end. data=msgNumber; data=data<<16; while(!InsertAt(data,insertedAt)); return TRUE; } //Find free slot at max "range" away from "startIndex " and return the offset. int FindFreeSlot(int startIndex,int range) { int i; //find free slot onwards from start index and return the relative offset if found for(i=startIndex+1;i<MAX_SIZE&&i<startIndex+range;i++) { if(buffer[i]==0) return i-startIndex; } if(startIndex+range>=MAX_SIZE) { //find free slot from 0 index onward but before startIndex and in range for(i=0;i<startIndex&&i<=(startIndex+range)%MAX_SIZE;i++) { if(buffer[i]==0) return i+(MAX_SIZE-startIndex); } } return -1; } //Find first free slot int FindFreeSlot() { int i; for(i=0;i<MAX_SIZE;i++) { if(buffer[i]==0) return i; } return -1; } //Maximum Number of threads in the system (this limitation is in demo version) const int THREAD_COUNT=4; //Maximum Number of characters in a string (this limitation is in demo version) const int MAX_STRING_SIZE=255; //tempBuffer array which holds the integers deleted from the integer buffer int tempBuffer[THREAD_COUNT*MAX_STRING_SIZE]={0}; //deleteIndexes array to hold index from which item was taken out of the integer buffer int deleteIndexes[THREAD_COUNT*MAX_STRING_SIZE]={0}; // tempBufferIndex points to next free slot in tempBuffer as well as deleteIndexed. int tempBufferIndex=0; // strings is a two dimensional array to hold string transferred by each thread char strings[THREAD_COUNT][MAX_STRING_SIZE]={0}; // stringIndexes [i] tells where to put the next character in strings[i] int stringIndexes[THREAD_COUNT]={0}; // nextAt[i] tells where the next character is expected to be inserted by thread i in the //integer buffer int nextAt[THREAD_COUNT]={0}; //This function tries to retrieve the strings transferred by many threads in an infinite loop void RetrieveString() { int data; int i=0; //Initialize the nextAt array to indicate no characters currently seen for(i=0;i<THREAD_COUNT;i++) nextAt[i]=-1; //Following lines of code deletes the first item from the buffer and stores it locally deleteIndexes[tempBufferIndex]=-1; while(deleteIndexes[tempBufferIndex]==-1) deleteIndexes[tempBufferIndex]=Delete(&data); tempBuffer[tempBufferIndex]=data; //Following loop goes on infinitely while(1) { deleteIndexes[tempBufferIndex+1]=-1; while(deleteIndexes[tempBufferIndex+1]==-1) { deleteIndexes[tempBufferIndex+1]=Delete(&data); //Try to rearrange already transferred characters RearrangeString(); } //put the deleted item from integer buffer into array local to consumer. tempBuffer[tempBufferIndex+1]=data; //advance the local buffer index tempBufferIndex++; } }

//This function tries to rearrange already transferred characters void RearrangeString() { int i; unsigned int msgNumber; int data; unsigned int charMask=255; int ch; for(i=0;i<=tempBufferIndex;i++) { data=tempBuffer[i]; if(data==0) continue; msgNumber=data>>16;//unpack the message number ch=data&charMask;//unpack the character if(ch==255) {//if it is start of string marker //unpack the next character offset. Calculate the absolute index at which next character is expected nextAt[msgNumber-1]=(deleteIndexes[i]+((data&0xFF00)>>8))%MAX_SIZE; //initialize the string index stringIndexes[msgNumber-1]=0; tempBuffer[i]=0; } else if(ch==0) {//if it is end of string marker //if it is the next character in the message if(nextAt[msgNumber-1]==deleteIndexes[i]) { //null terminate the string strings[msgNumber-1][stringIndexes[msgNumber-1]]=0; tempBuffer[i]=0;//remove item from tempBuffer printf("%s\n",strings[msgNumber-1]);//print it } } else { //if it is the next character in the message if(nextAt[msgNumber-1]==deleteIndexes[i]) { //form the string by storing the next character strings[msgNumber-1][stringIndexes[msgNumber-1]]=ch; stringIndexes[msgNumber-1]++;//increament index //unpack the next character offset and calculate absolute one nextAt[msgNumber-1]=(nextAt[msgNumber-1]+((data&0xFF00)>>8))%MAX_SIZE; tempBuffer[i]=0;//remove item from tempBuffer } } } }

Listing 1

This TransferString function implementation builds on integer buffer. This implementation assumes that integer size is 4 byte on the target platform. The 4 bytes of an integer item which is put into the buffer holds the following information (see Figure 1):

  • The lowest order byte call it byte number zero holds the character to be transferred.
  • The byte number 1 immediately next to byte number zero holds the relative offset of next character from the current one in the buffer.
  • Finally, the byte number 2 and 3 hold a 2-byte message number.

Figure 1: Composition of an item of the Integer buffer.

Again, there is a single consumer. This simplifies the problem because concurrency is not an issue. I am going to discuss a demo consumer implementation here. The consumer calls Delete function on the integer lock-free buffer in an infinite loop. The consumer remembers the index from which the item was taken out of the buffer as well as the integer data itself. In the infinite loop it calls RearrangeString function which loops through those remembered items and checks three conditions:

  1. Integer value contains the first special character to indicate start of string
  2. Integer value contains the last character to indicate the end of string
  3. Integer value is the next character of the string

In the first case, consumer calculates the index of the next character in the buffer and remembers it to check later on that an item with the same message number is really the next item in the sequence. In case 2 consumer prints out the string. In all three cases the consumer forgets this remembered item so that the same character is not repeated again. This approach works even in case when two nonconsecutive characters of the same string are inserted at the same index of the buffer because it is not possible to insert an item at an in use index. This means an item inserted first at the index is removed first. This result in RearrangeString function correctly linking the item inserted first at the same index. For a complete implementation of RearrangeString function, see Listing 1.

There are 0xFFFF possible message numbers which is very limiting for this implementation to be used for IPC in malware analysis tool. To make this implementation practical the same message number can be used again by the same thread in a process to communicate as many API calls as it makes. This implies that the resulting implementation can have 0xFFFF threads communicating API calls simultaneously. In other words, this lock-free buffer can tolerate failure of 0xFFFF threads. The system is not in dead-lock even if all message numbers are consumed. The new threads can continue their execution with out logging the API calls. No requirement of memory management is another advantage of this implementation. Although the demo implementation of consumer assumes maximum number of threads at compile time, however this technique has no such limitation. There can be any number of threads in the system.

Lock-freedom has the potential to make a difference. It invites you to think about new ideas and change the status-co in lock-free paradigm. You can start by thinking about ways to avoid the limitation on number of thread failures this implementation of lock-free buffer tolerates.

Acknowledgment

Thanks to Shafiq-ur-Rehman and Bilal Hashmi for guiding me on concurrency and lock-freedom related issues.

References

[1]M. Herlihy, "Wait-free synchronization," ACM Transactions on Programming Languages and Systems, vol. 11, no. 1, pp. 124"149, Jan. 1991.

[2]FireEye Anti-Malware and Anti-Botnet Security tool is an example. (http://www.fireeye.com/products/index.html).

[3]Details of API monitoring are out of scope of this article. See Detours for an example. (hhttp://research.microsoft.com/en-us/projects/detours/)

[4]H. Sutter. "Lock-Free Code: A False Sense of Security" (Dr. Dobb's, September 2008). (www.ddj.com/cpp/210600279)

[5]Michael, M.M. "Hazard pointers: Safe memory reclamation for lock-free objects," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 8, Aug. 2004.

Real World Parallelism Webinar Series
  • November 17, 2009
    Visual Effects for Animation - presented by DreamWorks Animation
    Speaker: Ron Henderson (Bio)

    Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).

    Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.

    Abstract:
    In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot.

  • December 1, 2009
    A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
    Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)

    Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.

    Abstract:
    Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code.

  • December 15, 2009
    How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
    Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)

    Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.

    Abstract:
    Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.