Interprocess Communication & the L4 Microkernel

By Slade Maurer, November 01, 2005

Modern microkernels such as L4 address many of the problems of older microkernels--specifically, the speed the interprocess communication (IPC) implementation.

November, 2005: Interprocess Communication & the L4 Microkernel

Slade Maurer is a computer engineer working in Silicon Valley. You can contact him via e-mail at [email protected].

A microkernel is an operating-system architecture that provides a small set of basic system calls that implement the core of the OS's functionality (usually including such things as thread management and address spaces), leaving more complex abstractions like networking to other programs called servers. A microkernel architecture sits in contrast to a monolithic kernel architecture, which rolls more of the OS's functionality into a larger kernel.

L4 is a high-performance "second-generation" microkernel developed by the late Jochen Liedtke at the IBM T.J. Watson Research Center in the mid 1990s. L4 X.2 is the second version of the L4 API, replacing the original API. The Pistachio microkernel is a product of the System Architecture Group within the Department of Computer Science at the University of Karlsruhe in Germany.

In this article, I focus on the 64-bit capable platform-independent Pistachio 0.4 microkernel operating system that implements the L4 X.2 API. Sometimes I may use the term "L4 microkernel" instead of "Pistachio" because the topic under discussion may pertain to any L4 microkernel and not Pistachio specifically. Pistachio (http://www.l4ka.org/) is the first kernel to implement the L4 X.2 API and is built with the specific goal of performance and portability.

During the early 1990s, the first generation of microkernels earned a bad reputation, primarily due to the implementation of IPC, cost of memory references, and attempting to support the "legacy" monolithic kernel interface. The first generation most notably includes Mach and Chorus, which were critical in the "real world" evaluation and research of microkernel design. Pistachio has many of the wonderful features of the first generation without compromising operating-system performance.

The Pistachio microkernel is an important product of the L4ka research project, whose primary goal is to develop a new methodology for system construction so as to manage the ever-increasing operating-system complexity and minimize legacy-code dependence. The microkernel technology under development is designed to meet these goals and, at the same time, not sacrifice performance, security, or reliability.

Another goal of the L4ka project is to provide a microkernel technology that can be used to construct any general or customized operating system. An example is the implementation of GNU's The Hurd using the Pistachio microkernel. In fact, the L4 microkernel family, in general, has this goal and a very interesting example is the re-implementation of the Linux operating system using L4.

The Pistachio microkernel currently supports many instruction set architectures (ISAs). This is due in part to a research goal of L4ka—it will perform well on any architecture. The supported ISAs include: IA-32, IA-64, PowerPC, PowerPC64, Alpha, MIPS-64, AMD64, SPARC v9, and ARM.

Pistachio depends heavily on interprocess communication (IPC), as do all microkernels. IPC is used to pass messages between threads in an operating system. One thread will create a message and then forward that message to another thread using IPC. This is a function of a microkernel—arguably the primary function—and L4 microkernels handle IPC with surprising agility.

L4 microkernels provide only a small set of system services and have shed from their design all nonessential features that reduce performance and/or increase complexity. For example, the performance of IPC within the Mach microkernel suffered greatly due to message validation; therefore, L4 microkernels do not validate messages but simply forward them as quickly as possible. This greatly improves IPC performance; however, if certain validations are still required by user threads, then they must facilitate message validation themselves.

L4 User Code Implementation

The L4 X.2 Reference Manual defines two programming interfaces: generic and convenience. The Generic Programming Interface (GPI) is an ISA-independent interface used for writing highly portable code but is not designed for ease of use. The Convenience Programming Interface (CPI) is derived from the GPI and is designed to make common operations as easy as possible for programmers to use.

The Generic Programming Interface is currently defined for the C and C++ languages; however, more language bindings will be incorporated in the future. The GPI concretely defines the logical interface and generic binary interface (GBI) as pseudo C++ classes. The GBI defines binary representations of data types and data objects independently of specific ISAs, although there is a separate version for 32-bit and 64-bit processors. The GPI is well defined and architecture independent.

In the strictest sense, the Convenience Programming Interface is not part of the L4 microkernel specification. All of the data types and procedures that can be implemented in the CPI can also be implemented using the GPI. It is an interface sitting above the L4 microkernel that makes it easier for programmers to write code. The focus of the CPI is ease of use and convenience—not completeness. Although most of the features of the GPI are available for use via the CPI, not all of them are. Additionally, and most importantly, the implementer of the CPI may choose to use ISA-specific features for performance reasons. When doing so, they may use the processor-specific binary interface to squeeze whatever performance benefits can be achieved when using a specific ISA.

When using the CPI for certain operations (specifically IPC), it is highly recommended to use an interface description language (IDL) code generator. This allows the programmer to write an interface description using an IDL such as CORBA. Along with rapid application development, using an IDL code generator greatly improves code portability without sacrificing performance. The L4ka research group provides a good quality IDL code generator called "IDL4" that supports CORBA and DCE. You can download this open-source project from their web site.

Listing 1 provides an example of using CORBA/IDL to define a simple interface. It is an example taken from the IDL4 Reference Manual and describes a simple module with a few methods to give you a feel for the language. Using CORBA, a module named storage is defined that contains the readln, writeln, and get_pos methods. Each method takes parameters and returns certain values in a similar fashion to C/C++.

You can see from Listing 1 that the syntax of CORBA is similar to C/C++ with the difference that parameters have a direction attribute: in, out, or inout. The direction attribute is used to indicate whether the parameter contains input data that must be copied from the client to the server, output data that is copied from the server to the client, or a combination of both. Also, from the listing we see that naming conflicts are resolved by encapsulating interfaces in modules.

Once you have written an interface in either CORBA or DCE, you use the IDL4 code generator to produce target code. Code generation is performed in a straightforward way by passing the IDL files that describe your modules through the IDL code generator, thereby producing the target source code. Most importantly, the header files in your target language are generated. You can then use these header files in your own code. This is a very convenient way to incorporate new code into an operating system based on the L4 microkernel.

If you want to try to write your own code for L4 X.2, you can develop code for whatever supported hardware you have available by using the IDL4 code generator and possibly some target-specific source code of your own. If you do not have hardware to use, you can use the PSIM PowerPC simulation distributed with GNU's gdb to simulate your target hardware. Also, using the simulation makes it easier to get into L4 because you can run it on top of your existing operating system and hardware.

To use PSIM, you will need to download the open-source GCC tool chain, binutils, gdb, IDL4, and the Pistachio microkernel. Clear instructions are included with Pistachio 0.4 in the file doc/notes/ppc-build.txt, explaining how to build the tools and kernel so that you can use the simulation. This allows you to simulate peripheral devices as well as symmetric multiprocessors without having the hardware! There are a few caveats, and building PSIM to run Pistachio requires editing some of the source to fix some bugs, but in the end, it is certainly worth the effort.

Threads

A thread in L4 encompasses a single context of execution that is running inside an address space. Additionally, L4 abstracts hardware interrupts as microkernel threads.

Within Pistachio, threads are identified by their thread IDs. The thread ID is either a 32-bit or 64-bit value, depending on what is supported by the target processor. It can either be global or local and all threads have both global and local thread IDs. The bits within the thread ID are divided into fields depending on the type of thread. Figure 1 provides the thread ID definition for each of the thread ID types.

Global thread IDs are unique and identify a thread independent of the address space in which it is being used. This type of thread ID has a version field that is used by the thread, not the microkernel, for whatever purpose it sees fit. However, the version field must have at least a single bit that is set to 1 so as to differentiate it from a local thread ID.

Local thread IDs exist within a particular address space and are only relevant to that address space. They have two fields: the thread number and the final six bits of the vector that are set to zero to identify it as a local thread.

Three special thread IDs are defined to cover special cases: nilthread, anythread, and anylocalthread. The nilthread is simply the thread ID of a nonexisting thread. The anythread ID matches any given thread ID, including all interrupt IDs. All threads that exist in the same address space match the anylocalthread ID.

L4's virtual registers offer a fast interface to pass data between the microkernel and user threads. Virtual registers are small, per-thread data-storage mechanisms that may or may not be mapped to actual hardware registers and/or memory depending on the specific target processor being used. A mix of processor registers and memory may be used to facilitate the actual amount of storage required. There are three types of virtual registers: thread control, message, and buffer registers.

Thread control registers (TCRs) are static, nontransient registers used to quickly pass mostly static control information between user threads and the L4 X.2 microkernel. They are a category of virtual registers used for IPC and other system functions. Functions are defined in the GPI that operate on the TCRs of the currently running thread to set, get, or deliver various data. For example, the GPI uses the ProcessorNo virtual register to determine the processor number that the current thread is running on.

L4 X.2 provides a small set of facilities for thread management. The specification defines the ExchangeRegisters system call to read and/or exchange the data of certain virtual registers of two threads. A privileged system call, named ThreadControl, is used by privileged threads to create and delete threads within the system.

Address Spaces

The L4 microkernel handles physical memory management and protection in an elegant manner. It is designed to support the recursive construction of addresses spaces outside of the microkernel itself.

There is an initial address space provided by the microkernel and the initial memory server owns it. Threads that have ownership of a virtual page can then use IPC to grant ownership of the page to another thread or share the page with another thread. An unmap method is also provided to remove shared pages from the recipient's address space in a single operation.

L4 handles paging in an interesting way. Paging is not performed by the microkernel, and instead, the threads composing the larger operating system provide the paging services. This approach provides much flexibility, allowing system implementers to tune the paging system to the intended application or even to provide multiple paging implementations within the same operating system.

An fpage, or flexpage, is a region of the virtual address space. An fpage generalizes a hardware page and is made from a single or multiple hardware page mapped in a region of the virtual address space other than the pages that have been mapped by the microkernel itself. Fpages have a specific minimum size that is processor dependent and must have a size that is a power of two. Fpage alignment is a problem left to the programmer, so particular care must be taken when using fpages not equal to the target hardware's page size. They can be mapped or unmapped by a thread only as a discrete unit—not partially.

Privileged threads using the SpaceControl method defined in the GPI configure address spaces. One or more threads can exist within an address space, and this set of threads can be changed as needed by the system.

More than one address space can be mapped, modified, and unmapped. The Pistachio microkernel does not require a fixed relationship between address spaces and threads. This means that any thread can be a member of any address space and can be moved from one address space to another.

Pistachio is specifically designed to support symmetric multiprocessing (SMP) and nonuniform memory architecture (NUMA) systems. The system calls supporting memory management are flexible and provide an orthogonal interface. The strictly orthogonal interface preserves the parallelism of concurrently running threads in SMP systems. The flexibility of the interface provides NUMA systems with thread portability between address spaces that is critical to meet the aggressive performance goals of many multiprocessing applications.

Messages

A message is data that is transferred from one thread to another using IPC. A message is passed using message registers (MRs), which are a category of virtual registers. The sending thread writes a message into the receiving thread's MRs and then the receiving thread reads the message from its MRs.

A message is composed of up to three sections: the tag, untyped-word, and typed-item. The message tag is mandatory as it contains message-control information and the user-defined message label. The message label may be used to encode a request key or define a method to be invoked upon message reception. The optional untyped-word section is used to pass arbitrary data that is untyped from the perspective of the microkernel and is simply copied to the receiving thread. The optional typed-item section contains a sequence of items that have their type encoded in the lowermost four bits of the first word.

Each thread has 64 MRs and a message can use some or all of these registers to transfer a message. The MRs are read-once registers, meaning that once they are read, their values are undefined until they are written again. This read-once property allows the MRs to be implemented not only in memory and special registers, but also by using general-purpose registers that the target processor may have available.

Moving data that requires more storage space than the MRs provide can be accomplished by using the typed-item message section. There are three types of items: StringItem, MapItem, and GrantItem. These types can be used to pass large data sets to another thread. They can be strung together such that several different items are passed within the same message.

MapItem and GrantItem are used to share HASH(0x80b824) of an address space between two threads. In this way, MapItem and GrantItem are a way to pass data by reference. When a thread receives a MapItem or GrantItem without error, it will have mapped a region of the sender's address space into its address space.

To share an address space region with another thread, the sharing threads use the GPI or CPI to perform the following steps. First, the sender creates an fpage using the GPI or CPI to specify an fpage representing the shared region's starting address, size, and permission bits. Then the sender creates a MapItem that contains the fpage. Next, the MapItem is encapsulated in a message, potentially along with other items, and it is sent to the receiving thread. To receive the MapItem, the receiver has to specify a receive window in its local address space to which the fpage will be mapped. Once the message is received and assuming no errors, the receiver is able to share the memory region with the sending thread.

Properly sharing memory within Pistachio requires a solid understanding of not only Pistachio, but also the configuration of the address space of the sharing threads as well as performance implications resulting from idiosyncrasies of the target hardware. Information on sharing memory the right way can be found in the L4 User Manual API Version X.2.

StringItems are used to pass data by value. A StringItem is a potentially unaligned sequence of bytes of up to 4 MB. The string is copied from the sending thread's buffer to the receiving thread's buffer during a send message operation. StringItems are used on the receiving thread's side to specify the receive buffer to be used.

There are two types of strings that can be sent. The first type is called a "simple string" and it is a contiguous sequence of bytes. The other type is called a "compound string," defined as a noncontiguous string that is composed of more than one contiguous substring. The substrings can be anywhere in the user's address space; however, they may not overlap. A compound string will be gathered together and sent as one logically contiguous string such that on the receive side, the string can be treated as one logical buffer.

When receiving a message containing StringItems, the receiving thread must specify the receive buffer that is used to hold the received string. This receive buffer is called a "string buffer." The receiving thread can specify several string buffers and each buffer is used to copy a string described by the StringItems in a message upon message reception. The receiver's string buffers are specified by using buffer registers (BRs), which are a category of virtual registers that can only be addressed directly—not via pointers. They keep their data values until they are explicitly modified. Additionally, there is a register called the "acceptor buffer register" that is used to specify which typed-items are accepted when a message is received.

IPC System Call

The basic operation within Pistachio that is used for interprocess communication and synchronization is IPC, which uses messages to pass data between two threads. As we have already discussed, the message can be used to copy data or to share memory. IPC can be used for communication within an address space as well as between two different address spaces.

IPC is synchronous, meaning that when a message is sent from one thread to another, both threads must call the corresponding IPC operation. The sender will block until either the receiver has performed the IPC call or the specified timeout interval has elapsed. IPC is also unbufferred, thereby improving efficiency and reducing complexity of the operation.

Figure 2 presents a definition of the IPC system call. An IPC system call combines an optional send phase followed by an optional receive phase. The phases that are included in a particular call are determined by the To and FromSpecifier parameters. The transition that is taken from the send to receive phase is atomic. In addition the to IPC system call parameters, IPC is controlled by the MRs, BRs, as well as some of the TCRs.

The messages that are passed contain the mandatory message tag field. If the IPC operation includes a receive phase, then the message received will contain a message tag. Within the message tag are the propagated IPC, cross-processor IPC, and error indicators. The propagated IPC indicator is used to detect that this IPC has been propagated through other threads before arriving at the receiver. The cross-processor IPC indicator simply says whether or not the message came from a sender running on a processor other than the one the receiver is currently running on. The error indicator is used to specify if an error occurred during this IPC operation. In a send-only IPC operation, the error indicator is still present in MR zero, the same place the message tag would be were there a receive phase, so that send-only errors can be detected. Once an error is detected, the detecting thread can read its ErrorCode TCR to determine the specific error that occurred.

There are three different types of pagefaults that may happen during IPC: pre-send, post-receive, and xfer pagefaults. Pre-send pagefaults occur in the sender's context before a message is transferred and the receiving thread is uninvolved. Post-send pagefaults happen in the receiver's context after the message has been transferred and the sender is not involved. Xfer pagefaults occur during message transmission and, for this reason, both the sender and receiver are involved, thereby exposing potential security problems. For example, if the pagefault happens in the receiver's address space, then a malicious receiver may starve the sender. Thankfully, messages without strings will not raise an xfer pagefault and, therefore, do not require special pagefault handling. Additionally, some protection is provided from the security problems incurred by xfer pagefaults because they are controlled by the minimum of the sender and receiver xfer timeouts.

A specialized system call named "LIPC" exists with an identical call signature to the IPC call, which is optimized for sending messages to local threads. It is intended to be used when it is clear to the programmer that there is a send and receive phase, the receiver is a local thread running on the same processor, the RcvTimeout is infinite, and the IPC does not include a map/grant operation.

The GPI defines the IPC and LIPC system calls as the C/C++ functions that are presented in Figure 3. The CPI provides several useful functions derived from the GPI as well as some support functions presented in Figure 4. It is interesting to see the plethora of useful functions that the L4 X.2 API designers have been able to derive from the IPC primitives.

IPC Implementation In Pistachio 0.4

The Pistachio microkernel is designed for high-performance IPC on a variety of processor architectures. It uses the most time-efficient data storage that is available, such as general-purpose registers, to move data as quickly as possible. All of this is done while preserving an architecture-independent 32-bit or 64-bit GBI.

The header file user/include/l4/ipc.h from the Pistachio 0.4 distribution is used to access the IPC functions in either C or C++. Because C++ supports function overloading, a single function that takes different arguments is defined. When writing in C, the L4 designers have appended substrings to the end of the function names to try to closely adhere to the polymorphic concept without having this support built into the language.

Take for example Listing 2, taken directly from the ipc.h header file. The C++ version of the L4_Call(...) function supports two different argument sets—with or without the timeouts. The same function in C has been defined twice, as L4_Call_Timeouts(...) and L4_Call(...). Also, notice that only the L4_Call_Timeouts function actually calls the L4_Ipc(...) system call and that the other two functions simply call this function.

The CPI functions that are defined for IPC within the ipc.h header file are given in Listing 3. The functions all use the substring L4_ to minimize name conflicts but are otherwise the same as those given in Figure 4.

Using IPC to communicate between two different address spaces requires the processor to be changed into privileged mode—typically a costly operation. Therefore, the specialized LIPC system call is defined specifically for communicating within the same address space.

The CPI functions that have been derived from the GPI will use the function that makes the most sense to support the required functionality. Listing 4 (available at http://www.cuj.com/code/) pulls the ipc.h include file from the definition of L4_Lcall(...). This function is implemented using the LIPC system call because it is used for local thread communication.

L4 uses a rather nice build and configuration system that is similar to what you find when you build the Linux kernel. You can use a make menuconfig to get a clean, menu-driven interface that is used to pick the processor architecture and other configurable parameters. When you pick a certain processor type, the proper binding is done such that the functions you are calling from ipc.h use the correct processor-specific code.

The processor-specific code for the two IPC functions lives in the syscalls.h file found in a subdirectory of user/include/l4 with the name of the processor type. For example, you will find the IA64 instruction set architecture-specific code in user/include/l4/ia64. Each syscalls.h file contains an L4_Ipc(...) and L4_Lipc(...) function that implements the IPC and LIPC system calls. The code that implements the two system calls on the various supported architectures is partially written in hand-tuned assembly code clearly for performance reasons. However, it is done in a manner that is remarkably easy to follow and is somewhat standard from one processor to another.

For example, Listings 5 and 6 (both available at http://www.cuj .com/code/) show the implementation of the IPC system call for the ARM and IA64 ISAs. Both functions are implemented using a good bit of native assembly code, yet they have a similar logical structure and use similar helper functions such as L4_IsNilThread(...). This clearly improves code maintainability and makes it as portable as possible. One very nice feature of having such similar code structure between architectures is that one can become familiar with the processor with which they have the most experience and then can leverage that knowledge to understand the implementation on other ISAs.

Take a look at the small number of lines of code used to implement the IPC system call in Listings 5 and 6. This improves not only maintainability and other software-engineering design goals but also performance. We speak to this topic in more detail in the following section.

IPC Performance

Arguably, the "poster child" of the first-generation microkernel is Mach. This first-generation kernel implemented a UNIX-like system call interface and other UNIX features such that it could become the next-step UNIX operating system. For this reason, Mach's design originally focused on the larger set of UNIX requirements and not simply the requirements of a microkernel. This resulted in unacceptably low performance. Additionally, performance was less important to system designers than was portability.

Over the years, designers tried valiantly to improve the performance of the system using the legacy code base with only modest success. Unfortunately, the operating-system research and development community took a strong interest in microkernel design during the same period that the Mach kernel was unable to significantly improve performance and the result of this is that today, most developers steer clear of the microkernel because of the understanding that with it comes poor performance. What is commonly believed is that the microkernel has very nice security, maintainability, concurrency, and redundancy features, but at the cost of significantly degraded system performance.

At the heart of the performance problem for the Mach microkernel is IPC. Since the microkernel design uses IPC extensively to pass messages between various server threads, the performance of IPC heavily dictates overall system performance. Even worse, performance drops as the system is chopped up into smaller threads that use IPC to communicate—working against the microkernel concept. After many attempts to improve the performance of IPC within Mach, the intense development activity in the late 1980's through the early 1990's all but died out.

Then, some interesting research into why the Mach microkernel's IPC is so slow began to emerge. The research indicated that IPC was not in itself a problem and that the real culprit behind performance degradation was the work required to check message validity and permissions. This was a very intriguing result and prompted Jochen Liedtke to create a new operating system, from scratch, that had a very thin IPC layer designed for speed. That new OS was the predecessor to L4 and he named it L3.

L3 simply passed messages as quickly as it could without any validation done within the microkernel. It was up to the user threads to do the message checking that they needed done—if any at all. IPC was tuned using assembly code and an intimate knowledge of the processor and hardware platform being used to garner every scrap of performance possible. This design resulted in a very fast IPC and a small microkernel. The results are impressive: L3's IPC mechanism is more than an order of magnitude faster than Mach.

L3 was heavily dependant on the hardware for which it was designed to run and still had some features that could be removed to further reduce its footprint. So, Jochen Liedtke created L4 after the brilliant IPC performance results of L3. L4 is even smaller than L3 and has shed more of the first generation—also known as Mach—features.

L4 is so small that it is not too difficult to make it portable across many architectures. So, the System Architecture Group at Karlsruhe University took L4 and created a C/C++ version called "Hazelnut" that is designed to run on many 32-bit ISAs. Then they moved on and created the topic of this article—Pistachio—that is portable across many 32-bit and 64-bit platforms. With the portability, great care has been taken through careful performace analysis at each stage of its development to preserve the high performance of the L4 microkernel's IPC mechanism.

System implementers tailor their IPC code to take advantage of the L1 cache-hashing function of the target processor. This, along with the small size of the Pistachio microkernel's IPC critical path means and the frequency that IPC is used results in the executed instructions residing entirely in the processor's L1 instruction cache most of the time. Also, the data that is used to pass a message is very small and can fit in the L1 data cache quite nicely. This significantly contributes to the microkernel's IPC performance because system implementers can tailor their code to take advantage of the L1 cache hashing functions of their target processor. Even in very bad cases, the IPC code and nontransient data will most certainly reside in the L2 cache due it its locality and small footprint.

The small amount of instructions and data used along the IPC critical path is a direct result of the simple implementation of IPC within L4 along with careful attention to performance. In a nutshell, the L4 community as a whole, and the Pistachio developers specifically, have focused on shortening the IPC critical path to such a degree that it tracks the processor performance improvement curve and no longer is subject to the much slower-to-increase memory curve.

Conclusion

The performance benefit of Pistachio on the new breed of chip multiprocessors (CMPs) is of particular interest. This category of processors can be split into two groups—the symmetric multiprocessors (SMPs) and the control-plane/data-plane processors (CPDPPs).

The SMP CMPs have internal bus designs that often provide extremely high bandwidth, nonblocking interconnects between processor cores. They may sport sophisticated L2 caches that can also be used as memory. They usually have specialized DMA engines on the same die that can be used to offload bulk data transfers.

The CPDPP CMPs contain many small RISC processors that do not have memory controllers, caches, or other common processor features. There can be anywhere from four to 32 of these "pico-processors" and they are controlled by one or two embedded control plane RISC chips that are much easier to program. This design supports some type of high-performance bus architecture and embedded memory along with local wide data busses going off-chip.

CUJ

1 2 3 4 5 6 7 8 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.