Checkpointing, CHPOX, and Linux

How Checkpointing Works

Here's a general description of how CHPOX works:

  1. After CHPOX's module (chpox_mod) is successfully compiled, it must be loaded into the kernel before any checkpointing can be done.

  2. To begin checkpointing, users must register the process (its children will automatically get registered) by either directly inserting the parameter to /proc/chpox, or managing it via chpoxctl (a userspace tool). If needed, you can also register the linked library so CHPOX can resume the process successfully. (According to my experience, this can be ignored depending what type of program you have).

    You also need to define a signal number for CHPOX to use. This signal can be set per registered process tree. You usually assign an unused signal number (like "signal number 31--SIGSYS") to avoid conflict with standard signals.

  3. To make a checkpoint, you can simply send the signal to the target process (normally using kill). The target process receives the signal and autonomously does the following:

    1. Save header of the dump file that contain information about hardware architecture, kernel version, dump file creation time and number of processes in the file.

    2. Save process name into a dump file. The name of the children's process are saved later if the parameter passed to CHPOX says to do so.

    3. Via task_struct info, CHPOX finds its related virtual memory area (using the find_vma() procedure) and dumps the content from vm_start to vm_end. To do this, CHPOX utilizes a modified version of VMADump (from BProc and Scyld).

    4. CHPOX also records additional information, such as opened UNIX domain sockets, opened files, status of pipes, current working directory. This information is saved into a single dump file. Checkpointing can be done as often as you like, and the dump file contains the latest successful checkpoint.

    5. If the parameter passed to chpox says to save children processes, the above steps (except for the first) are repeated to each child process.

    6. If you want to resume from the dump file, you must pass the dump file as an argument to the ld-chpox tool. To avoid conflict, make sure the old process has completed or has been killed; otherwise, unexpected errors can occur.

