Parallel

What's Different About Multiprocessor Software: Part 1

By Wayne Wolf and Princeton University, April 03, 2007

In the first of a five part series, Wayne Wolf, author of "High Performance Embedded Computing," delineates the differences between running software on embedded multiprocessors versus general purpose systems and the precautions that must be taken. This week: The role of the OS.

While real-time operating systems provide apparent concurrency on a single processor, multiprocessor platforms provide true concurrency. The concurrency and performance provided by multiprocessors can be very powerful but also harder to analyze and debug.

The purpose of this series of five articles is (1) to review what is unique about multiprocessor software as compared to both uniprocessor embedded systems and general-purpose systems: (2) to study scheduling and performance analysis of multiple tasks running on a multiprocessor: (3) consider middleware and software stacks as well as design techniques for them; and (4) look at design verification of multiprocessor systems.

As we move up to software running on embedded multiprocessors, we face two types of differences. First, how is embedded multiprocessor software different from traditional, general-purpose multiprocessor software? We can borrow many techniques from general-purpose computing, but some of the challenges in embedded computing systems are unique and require new methods.

Second, how is the software in a multiprocessor different from that in a uniprocessor-based system? On the one hand, we would hope that we could port an embedded application from a uni-processor to a multiprocessor with a minimum of effort, if we use the proper abstractions to design the software. But there are some important, fundamental differences.

The first pervasive difference is that embedded multiprocessors are often heterogeneous, with multiple types of processing elements, specialized memory systems, and irregular communication systems. Heterogeneous multiprocessors are less common in general-purpose computing; they also make life considerably more challenging than in the embedded uniprocessor world. Heterogeneity presents several types of problems.

1) Getting software from several types of processors to work together can present challenges. Endianness is one common compatibility problem; library compatibility is another.

2) The development environments for heterogeneous multiprocessors are often loosely coupled. Programmers may have a hard time learning all the tools for all the component processors. It may be hard to debug problems that span multiple CPU types.

3) Different processors may offer different types of resources and interfaces to those resources. Not only does this complicate programming but it also makes it harder to decide certain things at runtime.

Another important difference is that delays are much harder to predict in multiprocessors. Delay variations come from several sources: the true concurrency provided by multiprocessors, the larger size of multiprocessors, CPU heterogeneity, and the structure and the use of the memory system.

Larger delays and variances in delays result in many problems, including:

1) Delay variations help expose timing-sensitive bugs that can be hard to test for and even harder to fix. A methodology that avoids timing bugs is the best way to solve concurrency-related timing problems.

2) Variations in computation time make it hard to efficiently use system resources and require more decisions to be made at runtime.

3) Large delays for memory accesses makes it harder to execute code that performs data-dependent operations.

Scheduling a multiprocessor is substantially more difficult than scheduling a uniprocessor. Optimum scheduling algorithms do not exist for most realistic multiprocessor configurations, so heuristics must be used. Equally important, the information that one processor needs to make good scheduling decisions often resides far away on another processor.

Part of the reason that multiprocessor scheduling is hard is that communication is no longer free. Even direct signaling on a wire can take several clock cycles and the memory system may take tens of clock cycles to respond to a request for a location in a remote memory.

Because information about the state of other processors takes too long to get, scheduling decisions must be made without full information about the state of those processors. Long delays also cause problems for the software processes that execute on top of the operating system.

Of course, low energy and power consumption are important in multiprocessors, just as in uniprocessors. The solutions to all the challenges of embedded multiprocessor software must be found so that energy-efficient techniques can be used.

Many of these problems boil down to resource allocation. Resources must be allocated dynamically to ensure that they are used efficiently. Just knowing which resources are available in a multiprocessor is hard enough.

Determining on-the-fly which resources are available in a multiprocessor is hard too. Figuring out how to use those resources to satisfy requests is even harder. As discussed later in this series, middleware takes up the task of managing system resources across the multiprocessor.

Figure 6-1. Kernels in the multiprocessor.

Real-Time Multiprocessor Operating Systems
An embedded multiprocessor may or may not have a true multiprocessor operating system. In many cases, the various processors run their own operating systems, which communicate to coordinate their activities. In other cases, a more tightly integrated operating system runs across several processing elements (PEs).

A simple form of multiprocessor operating system is organized with a master and one or more slaves. The master PE processor determines the schedules for itself and all the slave processors. Each slave PE simply runs the processes assigned to it by the master.

This organization scheme is conceptually simple and easy to implement. All the information that is needed for scheduling is kept by the master processor. However, this scheme is better suited to homogeneous processors that have pools of identical processors.

Figure 6-1 above shows the organization of a multiprocessor operating system in relation to the underlying hardware. Each processor has its own kernel, known as the PE kernel. The kernels are responsible for managing purely local resources, such as devices that are not visible to other processors, and implementing the decisions on global resources.

The PE kernel selects the processes to run next and switches contexts as necessary. But the PE kernel may not decide entirely on its own which process runs next. It may receive instructions from a kernel running on another processing element.

The kernel that operates as the master gathers information from the slave PEs. Based on the current state of the slaves and the processes that want to run on the slaves, the master PE kernel then issues commands to the slaves about their schedules. The master PE can also run its own jobs.

One challenge in designing distributed schedulers is that communication is not free and any processor that makes scheduling decisions about other PEs usually will have incomplete information about the state of that PE. When a kernel schedules its own processor, it can easily check on the state of that processor.

When a kernel must perform a remote read to check the state of another processor, the amount of information the kernel requests needs to be carefully budgeted.

Vercauteren et al. [Ver96] developed a kernel architecture for custom heterogeneous processors. As shown in Figure 6-2 below, the kernel architecture includes two layers: a scheduling layer and a communication layer.

Figure 6.2. Custom multiprocessor scheduler and communications

The basic communication operations are implemented by interrupt service routines (ISRs), while the communication layer provides more abstract communication operations.

The communication layer provides two types of communication services. The kernel channel is used only for kernel-to-kernel communication - it has high priority and is optimized for performance. The data channel is used by applications and is more general purpose.

Example: TI's OMAP multiprocessor configuration
Cconsider the operating systems and communications in TI's OMAP. The OMAPI standard defines some core capabilities for multimedia systems. One of the things that OMAPI does not define is the operating systems used in the multiprocessor. The TI OMAP family implements the OMAPI architecture. The figure below shows the lower layers of the TI OMAP, including the hardware and operating systems.

Operating Systems and Communication in the TI OMAP

The main unifying structure in OMAP is the DSPBridge, which allows the DSP and RISC processor to communicate. The bridge includes a set of hardware primitives that are abstracted by a layer of software. The bridge is organized as a master/slave system in which the ARM is the master and the C55x is the slave.

This fits the nature of most multimedia applications, where the DSP is used to efficiently implement certain key functions while the RISC processor runs the higher levels of the application.

The DSPBridge API implements several functions: it initiates and controls DSP tasks, exchanges messages with the DSP, streams data to and from the DSP, and checks the status of the DSP.

The OMAP hardware provides several mailbox primitives - separate addressable memories that can be accessed by both. In the OMAP 5912, two of the mailboxes can be written only by the C55x but read by both it and the ARM, while two can be written only by the ARM and read by both processors.

Next in Part 2: Multiprocessor Scheduling

Used with the permission of the publisher, Newnes/Elsevier, this series of five articles is based on copyrighted material from "High-Performance Embedded Computing," by Wayne Wolf. The book can be purchased on line.

Wayne Wolf is professor of electrical engineering at Princeton University. Prior to joining Princeton he was with AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems.

References:
[Ver96] S. Vercauteren, B. Lin, and H. De Man, "A strategy for real time kernel support in application specific HW/SW embedded architectures," in Proceedings, 33-rd Design Automation Conference, ACM Press, 1996, pp. 678 " 682.

For more about multiprocessing issues on Embedded.com, go to More On Multicores and Multiprocessing. To read exerpts from other recent books on embedded hardware and software, go to More on The Embedded Bookshelf.

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

What's Different About Multiprocessor Software: Part 1

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

What's Different About Multiprocessor Software: Part 1

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content