Tools

Checkpointing Multithreaded Programs

By Christopher D. Carothers and Boleslaw K. Szymanski, August 01, 2002

Checkpointing is the process by which you grab snapshots of running programs.

Aug02: Checkpointing Multithreaded Programs

Christopher and Boleslaw are professors in the Department of Computer Science at Rensselaer Polytechnic Institute. They can be contacted at [email protected] and [email protected], respectively.

Checkpointing is the process by which you grab snapshots of running programs. The most common use of checkpointing is in fault-tolerant computing, where the goal is to minimize loss of CPU cycles when a long executing program crashes before completion. By checkpointing a program's state at regular intervals, the amount of lost computation is limited to the interval from the last checkpoint to the time of crash.

Another common use of checkpointing is in the synchronization of optimistic parallel discrete-event simulations, often referred to as "state saving." In these systems, out-of-order event computations are possible because the synchronization strategy employs a detection-and-recovery scheme. To support the recovery operation, the state of each simulation object must be recorded as it is modified. A number of checkpoint algorithms have been developed to reduce the amount of memory required for checkpointing, including infrequent state saving, incremental state saving, and reverse computation.

Our interest focuses on the fast, efficient checkpointing of threaded programs that execute on shared-memory platforms such as Linux. What motivated us to examine this topic were problems we encountered in developing parallel simulation and computation synchronization methodologies. This led us to a new algorithm for checkpointing large-scale, shared-memory multithreaded programs. Although based on Linux, the core idea and algorithmic approach is general enough for any operating system.

LinuxThreads

Multithreaded programs are typically implemented on Linux using threading packages such as LinuxThreads (http://pauillac .inria.fr/~xleroy/linuxthreads/) or Next Generation POSIX Threading (http://oss.software.ibm.com/pthreads/). Developed by Xavier Leroy, the LinuxThreads package is an implementation of POSIX 1003.1c Pthread interface, which provides the appearance of kernel-level threads by realizing each thread as a separate UNIX process that shares the same address space with all other threads. Scheduling between threads is handled by the kernel scheduler, just like scheduling between UNIX processes.

A drawback of LinuxThreads is that each thread is realized as a full kernel process. This prevents the kind of threading models where many threads can be bound to a particular kernel-level process or thread. Currently, SGI, Sun, and IBM UNIX variants support this threading model. Other problems with LinuxThreads include different process identifiers for each thread and the use of user-defined signals, which prevents programs that need both threading and user-defined signals from operating cleanly.

Threads in LinuxThreads are created via the clone system call, which lets processes be created so that they can share resources at a variety of different levels. In particular, a process and its child can be configured to share (or not) virtual memory, filesystem information, file descriptors, and signal handlers.

When a thread is created in LinuxThreads, a thread manager process is instantiated, which spawns the new thread using the clone system call. This manager thread then waits for other thread create requests. Additionally, it performs other thread management functions.

The clone system call can be used to checkpoint threaded programs, but requires modifications to the LinuxThreads library. In particular, the Pthread manager would have to be modified to clone itself where none of its previous resources are shared by threads. This newly cloned thread manager would then create new threads that share resources with the cloned thread manager. The members of the cloned thread group and their respective parent threads would then have to coordinate the transfer of a thread-specific state across two address spaces, such as thread stacks. The stacks could be reproduced by having the parent threads call setjmp to save the stack context and by having the child threads call longjump using the stack context pointer set by the parent threads' call to setjmp. Because thread stacks are realized in the heap space of the thread manager, stack copying could be avoided; however, there is a performance penalty associated with these additional system calls and thread synchronization.

An additional disadvantage of using clone for checkpointing is that implementation would be tied to LinuxThreads. Again, there are other thread packages available under Linux. Moreover, the LinuxThreads package is mated to the GNU C Library glibc and upgrading or modifying the local version of glibc is difficult and must be done with care. The problem is that such modifications risk breaking every binary in the system because of the use of shared libraries (for example, the new shared library is no longer compatible with the version your binaries were linked against). Since an OS checkpoint system call is not tied to a specific thread package, it is ultimately easier to implement and support.

System Calls and Process Creation

On the x86 port of Linux, system calls are realized by using interrupt 0x80. Internal to the OS is a jump table of system calls that relates their numbers to the specific code address where each system call begins.

Because the invocation of a system call is architecture specific, all top-level system call handler routines are in the ASM code directory. As a matter of convention, all system call handlers have the sys_ prefix; for example, the fork system call handler is sys_fork. These system calls then typically invoke a more general handler routine that is not architecture specific. The prefix for those handlers is do_. In the case of process creation, the general handler routine for all types of processes creation (fork, vfork, or clone) is do_fork.

As a part of the design of a system call, the kernel always provides access to the calling process's control block or task structure by invoking the macro current, as well as the CPU register state, which is passed as an argument. This macro returns a pointer to the task structure that invoked this system call. Beyond system calls, current is the process that has control of the CPU. In the case of multiple CPUs, each CPU is running a different process; thus, current will be different across CPUs. The structural layout of the process control block includes variables to record scheduling priority and policy, memory management, current processor, list pointers used to place a process in a wait queue or run queue, signal handler state, filesystem information, interprocess communication information, and process-specific statistics such as CPU usage, and so on. In Linux, this is the task_struct structure. Within the task_struct, process-specific memory-management data is encapsulated into its own structure, called mm_struct. This data structure contains a mapped address space of the process. Thus, by switching a process from one mm_struct to another, its execution address space is changed. We use this feature to cleanly implement our new system call. (These structures are defined in /usr/ src/linux/include/linux/sched.h.)

The Checkpoint Algorithm

Our approach leverages the existing copy-on-write capability of virtual memory by introducing a new checkpoint system call. This new system call is similar to the fork and clone system calls. The primary difference is that checkpoint considers all processes that are part of a multithreaded program. The algorithm works by creating a rendezvous of all threads inside the kernel. By using a rendezvous approach, the system call guarantees that the checkpoint is made consistent. No copy of the address is made until all threads have entered the system call and ceased all user-level execution.

Once all threads of a program are inside the system call, the thread with the smallest process identifier is made the parent or master thread. The parent thread creates a new mm_struct, which is a copy of its own memory-management structure. The parent thread then makes this new structure active by setting the task_struct memory-management pointers to the new mm_struct. Meanwhile, the other threads are in a barrier waiting for the parent thread to complete creation and swap of memory-management structures. Once the copy is complete, each thread then swaps its task_struct memory-management pointers for the ones in the parent thread. Now, all threads are actively using the new management structure.

At this point, our algorithm behaves like the clone system call. After swapping the old memory-management structure for the new one, each thread concurrently invokes the do_fork routine. This routine does the work of process creation. However, each thread invokes the do_fork routine in such a way that it shares the current memory address space. So, each new thread created uses the new memory structure that was just allocated and made active. Once all threads complete the do_fork routine, each thread then swaps the new memory-management structure back for its old one. Thus, the members of the new set of threads (children) are running in the copy-on-write shared address space of their original parent threads.

On returning from the checkpoint system call, the child threads have a return value of zero, and each parent thread has a return value of the child thread that it created. At this point, each parent thread could put itself to sleep or decide to lower its priority and slowly write its state to stable storage from which the program could be restarted in the event of a hardware failure.

To revert to a previous checkpointed state in the multithreaded program (that is, rollback), the child threads would signal the parent threads to wake up, then kill themselves. Thus, the rollback operation is completely accomplished at the user level. The parent threads could then decide to redo the checkpoint or progress forward, depending on the needs of the application.

Global Data Structures

Listing One presents the new global data elements. The design philosophy is that correctness and robustness must be guaranteed to the greatest possible extent because this is OS-level software. In keeping with this design philosophy, we employ a multiphase approach in which a barrier synchronization among all the threads is used between each phase.

The first variable is checkpoint_waits. This array of four integers implements the various barriers between phases. The checkpoint_mm_lock is a lock for the checkpoint_mm variable, which is a pointer to the current memory-management structure being checkpointed among a group of threads. Since only one set of threads can be checkpointed at a time, checkpoint_mm_lock prevents another set of threads from initiating a checkpoint operation until the current set is complete. checkpoint_task_lock provides internal synchronization and coordination between phases. Finally, checkpoint_parent_task is the pointer to the thread that is master (that is, possesses the smallest process identifier) among all the threads involved in the checkpoint operation.

Core Algorithm

In keeping with Linux system call convention, sys_checkpoint is the top-level handler of the system; see Listing Two. This handler routine invokes the architecture-independent routine, do_checkpoint, and is divided into phases: admission, create_mm, clone_threads, restore_mm, and leave. These phases correspond to the different parts of Listing Two.

The admission phase (Listing Three) determines which threads are allowed into the core parts of the checkpoint system call. The first part determines if there are no other threads sharing the current process's memory-management structure (that is, a single threaded/uniprocessor program). If so, the checkpoint system call behaves just like a fork system call by directly invoking the do_fork general handler routine. This is possible because the do_fork routine can handle the concurrent processing of fork system calls since shared variables are placed inside of critical sections.

Since setting checkpoint_mm signals the existence of a group of threads in the process of checkpointing, it must be determined if the current process is with the current checkpoint group by comparing the checkpoint_mm variable to mm, the process's memory-management structure pointer. If the process is with the checkpoint thread group, then it is allowed to pass through the barrier; otherwise, it waits via the schedule_timeout internal routine for, say, 10 milliseconds. During this time, the Linux scheduler executes other runnable processes. This kind of barrier enables many threads bound to a single processor to be involved in a checkpoint operation. Next, if checkpoint_mm is not set, then this process atomically sets the variable to the address of its memory-management structure. Once a thread is admitted (moves past the first barrier), it determines the number of other threads in this thread group using the memory-management structure's mm_users variable.

The current process checkpoint_counter variable records the number of times this process has been checkpointed. Currently, we are special casing the first checkpoint for programs that use LinuxThreads. Since LinuxThreads creates a thread manager, the mm_users variable is 1 greater than the number of checkpointing threads. Consequently, we need to reduce the number of mm_users by 1 to keep an accurate count of the number of threads involved in the checkpoint operation. This is crucial because the subsequent barriers block until every process moves into the barrier.

The last part of the admission phase is the election of the parent thread followed by a task barrier. The task barrier uses independent wait variables because, with a large numbers of threads, it cannot be guaranteed that the last thread has left the previous barrier before the first one enters the next barrier. Lastly, these barriers only allow an atomic evaluation of the barrier condition. We took this conservative approach to ensure robustness. While it may be possible to relax this condition, a more comprehensive analysis and testing on other processors is needed before any conclusions can be made about the efficacy of this synchronization approach.

The create phase in Listing Four lets the parent process allocate a new memory-management structure, then swap this new one for its original. During this allocation the other threads wait in the checkpoint_wait barrier, which releases them only when the parent has completed the allocation and swap of memory-management structures. Once complete, each process then swaps the original memory-management structure for the new one as well.

A closer look at the memory-management structure allocation/swap process reveals a number of interesting details. To create a new memory-management structure, the space is not only allocated, but also copied. After the copy, a Linux-specific initialization routine is invoked (not shown in the algorithm). After that, the virtual memory page tables are duplicated in the dup_mmap routine. In Linux, this operation is encapsulated in a semaphore. Last, descriptor tables that are used by the processor to perform address translation are copied in the copy_segments routine.

The swapping of memory-management structures requires that the old structure be deactivated and the new one must take its place. This is done by the activate_mm routine. With the new memory-management structure created, the threads enter the clone phase; see Listing Five. In it, each thread creates a child thread using the do_fork handler routine that takes its place and utilizes the newly allocated address, which is a copy-on-write instance of the original address space. Once complete, all threads synchronize in the third barrier. Next, the original memory-management structure needs to be restored (see Listing Six). The restore_mm completes this task by reverting back to the original memory-management structure and then reactivating it.

The last phase of the checkpoint system call is leave. In Listing Seven, the parent waits in the fourth barrier while all other processes exit. Once all threads have left, the parent resets all global variables, which lets the next set of threads enter the system call, thus restarting the algorithm.

Performance Study

The computing platform we used is a dual-processor (400-MHz Pentium II) system running Linux 2.4 with 256 MB of shared RAM. In the first series of experiments, we compared the execution time of the checkpoint system call to a user-level memory copy method of checkpointing as a function of the number of threads and the amount of data being checkpointed. We measured performance in terms of speedup relative to memory copy (that is, memory copy execution time divided by system call execution time).

Figure 1 shows that the speedup for the two-thread case varies from 25 to 67. Speedup is attributed to the efficiency of copy-on-write semantics of the underlying virtual memory system. Interestingly, nonlinear speedup behavior is observed. For instance, there is a large drop off in speedup when the data size changes from 8 MB to 16 MB, then a sharp increase at 32 MB followed by a sharp decrease at 64 MB. While we don't completely understand the cause of this nonlinear behavior, we hypothesize that it is due to differences in the amount of data copied between memory copy and checkpoint at the various data points. However, a more thorough performance analysis of the Linux virtual memory subsystem is required before any definitive conclusions can be drawn.

Next, Figure 2 shows the speedup results for the four- and eight-thread cases. There are between two to four times more threads than processors. Thus, each thread context switches several times during the processing of the system call and generates much greater overheads. Because of this, there's a significant drop in speedup, particularly for small checkpoint data sizes. However, what is surprising is that at 32- and 64-MB data sizes, the speedup results are above four for the four-thread case and above two for the eight-thread case.

As indicated in the first series of performance results, high speedups are attributed to the copy-on-write semantics of the underlying virtual memory system. To better understand how these raw system-call performance statistics would translate into overall start-to-finish program performance, we conducted a full program performance test where the start-to-finish execution time of a synthetic workload program was measured. The workload program consists of two threads and 64 MB of data. The synthetic threaded program performs 10 checkpoint operations of system using either memory copy or the checkpoint system call. In between the checkpoints, the amount of modified data is varied from 4 KB to 1 MB.

The total execution of the program using the checkpoint system call takes around one second with a small increase in total execution time as the amount of modified data is increased. However, we saw that the memory copy execution time remains unchanged regardless of how much data is modified. When execution times are translated into speedup results, we see that overall program performance is increased by a factor of eight for small data sizes and a factor of five for the 1-MB data size when the new system call is used. We have observed that when the amount of modified data approaches the total amount of data in the program, the execution time is the same for both memory copy and the checkpoint system call.

Acknowledgments

This work was partially supported by DARPA contract #F30602-00-2-0537 with the Air Force Research Laboratory (AFRL/IF) and by a grant from the University Research Program of Cisco Systems Inc.

DDJ

Listing One

int checkpoint_waits={0,0,0,0}
pid_t checkpoint_min_pid=0x7fffffff
spinlock_t checkpoint_mm_lock=
  SPIN_LOCK_UNLOCKED
struct m_struct*checkpoint_mm=NULL
  struct_task_lock = SPIN_LOCK_UNLOCKED
struct tast_struct*
  checkpoint_parent_task=
  NULL

Back to Article

Listing Two

sys_checkpoint( regs )
{
  return(do_checkpoint(regs));
}

do_checkpoint(regs)
{
  admission();
  if( create_mm( regs ) == error )
    return( error );
  clone_threads(regs);
  restore_mm();
  leave();
}

Back to Article

Listing Three

admission()
{
  old_mm = current->mm;
  if( current->mm->mm_users == 1 ) 
    {
      clone_flags = SIGCHLD;
      return(do_fork(clone_flags, regs));
    }
  
  spin_lock(&checkpoint_mm_lock);
  
  while( checkpoint_mm != NULL && 
	 checkpoint_mm != old_mm )
    {
      spin_unlock(&checkpoint_mm_lock);
      schedule_timeout(1);
      spin_lock(&checkpoint_mm_lock);
    }
  
  if( checkpoint_mm == NULL )
    checkpoint_mm = old_mm;

  spin_unlock(&checkpoint_mm_lock);
  
  if( current->checkpoint_counter == 0 )
    {
      mm_users = current->mm->mm_users - 1;
    }
  else
    {
      mm_users = current->mm->mm_users;
    }
  
  spin_lock(&checkpoint_task_lock);
  
  if( current->pid < checkpoint_min_pid )
    {
      checkpoint_min_pid = current->pid;
    }
  
  checkpoint_waits[0]++;
  
  while( checkpoint_waits[0] < mm_users )
    {
      spin_unlock(&checkpoint_task_lock);
      schedule_timeout(1);
      spin_lock(&checkpoint_task_lock);
    }
  
  spin_unlock(&checkpoint_task_lock);
}

Back to Article

Listing Four

create_mm(regs)
{
  if( current->pid == checkpoint_min_pid )
    {
      checkpoint_parent_task = current;
      parent = current;
      if(( new_mm = allocate_mm() ) == NULL )
	{
	  notify_other_threads_of_error;
	  reset_ global_variables;
	  return(error);
	}

      memcpy(new_mm, parent->mm);
      dup_mmap(new_mm);
      copy_segments(new_mm);
      old_mm = parrent->mm;
      parent->mm := new_mm;
      parent->active_mm = new_mm;
      activate_mm(old_mm, new_mm);        
      spin_lock(&checkpoint_task_lock);   
      checkpoint_waits[1] = 1;
      spin_unlock(&checkpoint_task_lock);  
    }
  else
    {
      spin_lock(&checkpoint_task_lock);  
      
      while( checkpoint_waits[1] == 0 )
	{
	  spin_unlock(&checkpoint_task_lock);
	  if( parent_detects_error )
	    {
	      return(error);
	    } 
	  schedule_timeout(1);                  
	  spin_lock(&checkpoint_task_lock);   
	}
      spin_unlock(&checkpoint_task_lock);   
      parent = checkpoint_parent_task;          
      old_mm = current->mm;  
      current->mm = parent->mm; 
      current->active_mm = parent->mm; 
      current->mm->mm_users++; 
      activate_mm(old_mm, current->mm); 	
    }
}

Back to Article

Listing Five

clone_threads(regs)
{
  current->checkpoint_counter++; 
  clone_flags = (CLONE_VM | SIGCHLD); 
  retval = do_fork(clone_flags, regs);
  current->checkpoint_counter--; 

  spin_lock( &checkpoint_task_lock); 

  while( checkpoint_waits[2] < mm_users )
    {
      spin_unlock(&checkpoint_task_lock); 
      schedule_timeout(1);		      	
      spin_lock(&checkpoint_task_lock); 
    }
  spin_unlock( &checkpoint_task_lock); 
}

Back to Article

Listing Six

restore_mm(regs)
{
  new_mm = current->mm; 
  current->mm = old_mm;   
  current->active_mm = old_mm; 
  new_mm-> mm_users--; 
  activate_mm(new_mm, old_mm); 
}

Back to Article

Listing Seven

leave(regs)
{
  if( parent == current )
    {
      spin_lock(&checkpoint_task_lock); 
      while( checkpoint_waits[3] != mm_users -1 )
	{
	  spin_unlock(&checkpoint_task_lock); 
	  schedule_timeout(1); 
	  spin_lock(&checkpoint_task_lock); 
	}

      checkpoint_min_pid = 0x7fffffff; 
      checkpoint_waits = {0,0,0,0}; 
      checkpoint_parent_task = NULL;

      spin_lock(&checkpoint_mm_lock); 
      checkpoint_mm = NULL; 
      spin_unlock(&checkpoint_mm_lock); 
      spin_unlock(&checkpoint_task_lock); 
    }
  else
    {
      spin_lock(&checkpoint_task_lock); 
      checkpoint_waits[3]++; 
      spin_unlock(&checkpoint_task_lock); 
    }
}

Back to Article

1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Tools

Checkpointing Multithreaded Programs

LinuxThreads

System Calls and Process Creation

The Checkpoint Algorithm

Global Data Structures

Core Algorithm

Performance Study

Acknowledgments

Listing One

Listing Two

Listing Three

Listing Four

Listing Five

Listing Six

Listing Seven

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Tools

Checkpointing Multithreaded Programs

LinuxThreads

System Calls and Process Creation

The Checkpoint Algorithm

Global Data Structures

Core Algorithm

Performance Study

Acknowledgments

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content