November 01, 2006
Embedded Multicore Needs Communications StandardsMarkus Levy and Sven Brehmer
Multicore systems are popular but problematic. The authors describe the problem with communications APIs and what the Multicore Association is doing about it.
Multicore systems are popular but problematic. The authors describe the problem with communications APIs and what the Multicore Association is doing about it.What do firefighting and multicore programming have in common? Both are hot jobs. Firefighters and multicores both need to get the job done as quickly and effectively as possible. They both require reliable, standardized tools. Firefighters always act as a team, and the same goes for multicore. But most importantly, they both have to communicate well. Without communication, firefighters don't survive and the cores in a multicore system may as well be operating alone. Analogies aside, it's important to point out that "excellent communication" is a relative term that depends on the application's requirements. Regardless of the implementation, however, multicore systems can be classified according to their memory architectures and their communication mechanisms. Before we go on, we should point out that when we say multicore, we're talking about systems with two or more processing elements, including homogeneous (same processor type) and heterogeneous (different processor types) multiprocessor systems, as well as coprocessors and hardware accelerators. We should also point out that this article focuses on multicore-enabled closely distributed embedded applications, but we'll take a look at the similarities and differences of the memory architectures and communication application programming interfaces (APIs) used in desktops, servers, and networks.
Memory architectures and communication APIs In a distributed memory system each of the processors can only access its own local memory; no global memory address space exists across them, and communication relies on various forms of message passing. A core with its own local memory doesn't have to share the access to its memory, providing an efficient and scalable structure. When one core requires data from another core or cores need to synchronize among themselves, data must be physically moved (in other words, not by reference). Message passing can be asynchronous, meaning that while waiting for data, other computations can be performed until the data arrives. Alternatively, message passing can be synchronous; the waiting task is blocked until data arrives. If both shared and local memory is available it's possible to create efficient communications structures by combining the best features of both.
Traditional comms APIs For shared memory architectures that use simple communication schemes, the most widely used implementations and APIs are proprietary. For example, TI's DSP/BIOS Link is specifically designed for the company's chips. For more complex implementations and communication schemes, operating systems with symmetric multiprocessing (SMP) are commonly used. From a standards perspective, OpenMP is probably the most widely used API for shared memory architectures, supporting multiprocessing programming in C/C++ and Fortran on many architectures, including UNIX and Microsoft Windows platforms. OpenMP consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer. The core elements of OpenMP are the constructs for thread creation, workload distribution (work sharing), data environment management, thread synchronization, user-level run-time routines, and environmental variables. OpenMP supports incremental parallelism, allowing it to work on one portion of the program at a time; no dramatic code changes are needed. This means that OpenMP can be gradually introduced into existing applications, thereby reducing the pain incurred when transitioning from a single core to a multicore system. Since OpenMP is most useful to parallelize loops, it may only be applicable to a portion of an application, which limits the opportunity to exploit all forms of parallel execution. Since embedded applications are generally more event driven than general-purpose and high-performance computing, it's possible that the opportunity for parallel (loop) threads is further reduced.
Supporting communication MPI is the most popular form of message passing APIs for widely distributed computing, being both portable (MPI has been implemented for many distributed memory architectures) and fast (each implementation is optimized for the hardware on which it runs). However, although MPI is powerful, it's complex and will likely in its full form consume memory resources beyond what's acceptable in a multicore chip's available memory; MPI may introduce too much computational overhead (latency). Nevertheless, useful features can be borrowed from MPI for an embedded application. Often a so-called hybrid model for parallel programming, using both OpenMP and MPI, is used for programming computer clusters.
Embedding communication The resource management, communications, and synchronization required for embedded distributed systems are some specific areas of programming multicore systems that must be addressed. The reality is that such systems can't rely only on a single operating system--or even an SMP operating system--for such services. When heterogeneous multicore systems employ a range of operating systems across multiple cores, it means they have resources that can't be managed by any single operating system. This situation is exacerbated further by the presence of hardware accelerators that don't run any form of operating system but must interact with processes that are potentially running on multiple operating systems on different cores. Before going into the details of the high-performance communication mechanisms required by embedded systems, let's examine the functions and communication requirements of some example embedded applications. In particular, let's look at an automotive application employing tens to hundreds of sensor inputs, which must be read on a periodic basis, and an application that processes network packets.
Controlling an engine Ideally, the sum of the latencies plus message send/receive times should be less than latency of the control loop given current engine RPM. In general, individual tasks are expected to complete in times varying from 1ms up to 1,600ms, depending on the nature of the sensor and the type of processing required for its data. In this application, the control task must be able to determine if data is available from each data task, and if not, the task should be able to proceed in a nonblocking fashion using the last data from the sensor in question; this implies some form of nonblocking message test or select mechanism. It would also be desirable for the control task to use a light-weight communication API to send updates to actuators as "messages." The data task should use this API to read data from the sensor, implying some sort of driver implementation underneath this API. Finally, the data task should be able to do a nonblocking message send to the control task. Ideal, we'd be able to try out different ways to partition an application to optimally distribute the application across multiple cores. For example, the automotive application could use one processor for control and one for data tasks; one processor for both control and data tasks plus a SIMD core for signal processing, and other special purpose processors for the remaining data processing; or simply dedicate one core, cylinder, or group of cylinders. Using a standard API that's the same for the different cores and operating systems would make it much easier to try different implementations.
Processing packets
![]()
There are two desired modes of packet transfer between modules. In one mode, IP packets are brought in from the outside world and streamed between the modules without going through external shared memory. In another mode, packets are placed in shared memory between modules, and each module accesses the packets from the shared memory. A commonly used hybrid of these modes is one in which packets are placed in shared memory, while packet descriptors and metadata are streamed directly between cores without going through shared memory. In this application, although each of the modules in this system has an independent flow of control, the modules must also communicate and synchronize with each other. The modules access both private and shared data. Therefore, the modules need both local memory (preferably cache) and global shared memory. In general, it's preferable if the shared memory could be globally shared between all the cores. Much of the communication between the cores follows a stream pattern, occurring once or twice per packet. Thus, Module 3 must efficiently receive (from Module 2) and transmit (to Module 4) 20 to 50 bytes of data for every packet it processes, in other words for each 100 to 2,000 cycles of execution. From a computational perspective, all processing in this application is both parallel and pipelined. Each packet is first processed in a pipelined manner by Module 1 and then by Module 2 (representing hardware acceleration). Modules 1 and 2 perform very little computation for each packet, but they must quickly examine the metadata before the packet is sent to Module 3. All the compute elements represented by Module 3 handle the compute-intensive parallel processing of packets. Each instance of Module 3 performs between 100 and 4,000 cycles of computation on each packet. In other words, multiple Module 3s process independent packets in parallel. Module 3 also accesses shared memory for packet data while it's computing. After Module 3 is done, it might send some data to Module 4 indicating how to process the packets before they're shipped back out on the network. Performance is commonly scaled by adding more modules (for example, Module 3), without changing the code in Module 3. This implies several important factors. First, the communication and synchronization APIs must be flexible enough and scalable to allow simple upgrade of the system. Second, although the latency of processing a packet is important, the packet throughput through the system is the key processing metric and potential bottleneck. Hence, the communication mechanisms that support the packet transfers must be able to efficiently move the data. Another important factor relates to time-to-market and the ability to quickly port third-party software that previously ran on sequential processors. With a standard communications API, a communications module can simply be added to each functional module, simplifying the porting.
There's an API for us The Multicore Association's API's will form a layer on top of which other abstractions or applications may be built as shown in Figure 2. For maximum performance, an application can interact directly with the API for inter-core communication, synchronization, and resource allocation. In other words, the application can avoid a series of expensive operating-system calls.
![]()
There's a long list of features and functions that these APIs could support, but in actuality these APIs represent a subset of existing APIs. The real challenge in developing such APIs is determining what functions and features not to use. In other words, to meet the stringent demands of the single-chip (or otherwise closely distributed) multicore platform, the APIs may draw upon concepts implemented in the traditional protocols (programming models, semantics, and so forth), which were originally intended for large-scale computing platforms, but within the resource constraint of an embedded multicore system. Some of the features the Multicore Association is considering for the APIs include making the source-code portable and reusable so the architecture can be processor independent and enabling implementations to be scalable for messaging performance and memory footprint. It should also be possible to build more powerful and complex capabilities on top of this API to enable system-level control via message passing. The embedded systems industry appears to have endorsed multicore technology, but the gap between its capabilities and the available software support for multicore implementations continues to grow. In this article, we've barely scratched the surface of the issues being resolved in the embedded multicore world, as well as the standards that are being developed within the Multicore Association. Also being explored within the Multicore Association is multicore debugging, an entire topic unto itself. More information can be found at www.multicore-association.org. Markus Levy is founder and president of the Embedded Microprocessor Benchmark Consortium and serves as the president of the Multicore Association president. He's worked for EDN and Instat/MDR and is coauthor of Designing with Flash Memory. He also worked for Intel as a senior applications engineer and customer training specialist for Intel's microprocessor and flash memory products. You can reach him at markus@multicore-association.org. Prior to founding PolyCore Software, Sven Brehmer served as senior director in charge of Wind River's Embedded Platforms Division, then home of VxWorks, pSOS, and VSPWorks. He came to Wind River through its acquisition of ISI in 2000, where Brehmer served as the COO and executive vice president of DIAB-SDS, a subsidiary of ISI. Prior to DIAB-SDS, Brehmer was the president and CEO of Diab Data. Brehmer has a master's in electronics engineering from the Royal Institute of Technology, Stockholm, Sweden.
|
|
||||||||||||||||||||||||||||
|
|
|
|