February 12, 2007
Next-gen Multicore Networks-on-chip Systems: Part 2Luca Benini and Giovanni De Micheli
In the second in a six part series based on their book "Networks On Chips," Luca Benini and Giovanni De Micheli describe the architectural, programming and debug challenges of nextgen multicore networks-on-chips. This week: SoC objectives and NoC needs
There are several hardware types of SoC designs that can be defined according to the required functionality and market. In general, SoCs can be classified in terms of their versatility (i.e., support for programming) and application domains. A simple taxonomy is described next: General-purpose
on-chip
multiprocessors are
high-performance chips that benefit
from
spatial locality to achieve high performance. They are designed to
support various applications, and thus the processor core usage and
traffic patterns may vary widely. They are the evolution of on-board
multiprocessors, and they are typified by having a homogeneous set of
processing and storage arrays. For these reasons, on-chip network design can benefit from the experience on many architectures and techniques developed for on-board multiprocessors, with the appropriate adjustments to operate on a silicon substrate.
Application-specific SoCs
are hardware chips dedicated to an application. In some cases, as for
all mobile applications, energy consumption is a major concern. Most
application-specific SoCs are programmable, but their application
domain
is limited and the software characteristics are known a priori.
Thus, some knowledge of the traffic pattern is available when the NoC is designed. In many cases, these systems contain fairly heterogeneous computing elements, such as processors, controllers, digital signal processors (DSPs) and a number of domain-specific hardware accelerators. This heterogeneity may lead to specific traffic patterns and requirements, thus requiring NoCs with specialized architectures and protocols. SoC platforms are
application-specific SoCs dedicated to a family of applications in a
specific domain. Examples are SoCs for GSM telephony support and
platforms for automotive control. A platform is more versatile in
nature, as it can be used in different (embedded) systems by different
manufacturers.
Thus, versatility and programmability are preferred to customization, yielding SoCs that can be produced in high volumes, and thus offset the non-recurrent engineering (NRE) costs. Whereas the processing and storage unit may differ in nature and performance, the traffic patterns are harder to guess a priori as the application software may vary widely. Field-programmable gate arrays
(FPGAs) are hardware systems where the functionality
is
determined after manufacturing by connecting and configuring
components.
Components vary in size and in functionality and are connected by
reprogrammable networks. These networks are simple and provide bit-level connectivity with little or no control. Nevertheless we expect FPGAs to grow substantially over the coming years and require effective NoC communication. Some Design Examples One of the first multiprocessor
designed around an NoC is the RAW architecture [32]. This is a fully
programmable SoC consisting of an array of identical computational
tiles with local storage. Full programmability means that the compiler
can program both the function of each tile and the interconnections
among them. The name RAW stems from the fact that the "raw'' hardware
is fully exposed to the compiler. To accomplish programmable
communication, each tile has a router. The compiler programs the
routers on all tiles to issue a sequence of commands that determine
exactly which set of wires connect at every cycle. Moreover, the
compiler pipelines the long wires to support high clock frequency. The Cell processor [26] was
developed
by Sony, Toshiba and IBM to build a general-purpose processor for a
computer, even though it is primarily targeted for Sony's Playstation
3. Its architecture resembles multiprocessor vector supercomputers,
targeting high-performance distributed computing. The architecture comprises one 64-bit
power processor element (PPE), eight synergistic processor elements
(SPEs), memory and interconnection. The PPE is a dual issue, dual
threaded in-order RISC processor, with 512K cache. Each SPE is a
self-contained in-order vector processor which acts as an independent
processor. Each contains a 128 × 128-bit
register, four (single-precision) ýoating point units and four integer
units. The element interconnection bus (EIB) connects the PPE, the
eight SPEs and the memory interface controller (Figure 1.8, below). The EIB has
independent networks for commands (requests for data from other
sources) and for the data being moved. Commands are filtered through address
concentrators which handle collision detection and prevention, and
ensure that all units have equal access to the command bus. There are
multiple address concentrators, all of which forward data to a
single-serial command reýection point. Data transfer is elaborate. There are four "rings,'' each of
which is a chain connecting all data ports. Data can move down a ring
only in one direction. For instance, a connection that allows data to
move from the PPE to SPE1 cannot be used to move data from SPE1 back to
the PPE. Two rings go clockwise and two counterclockwise, and all four rings have the components attached in the same order. Each ring can move 16 bytes at a time from any position on the ring to any other position. In fact, each ring can transmit three concurrent transfers, but those transfers cannot overlap.
The Nexperia architecture, developed
by Philips (NXP
Semiconductdor), is a platform for handling digital video and audio
in
consumer electronics (Figure 1.9, below).
It uses one or more 32-bit MIPS CPUs for control processing, and
one or
more 32-bit Trimedia processors for streaming data. Moreover, the
platform can house a ýexible range of programmable modules, such as an
MPEG decoder, a UART, etc. To connect the CPUs and other modules
with each other and with the main external memory, a high-speed memory
access network, and two device control and
status (DCS) networks are used. These DCS networks enable each
processor to control and observe on chip the status of the other
modules. One of the advantages of the platform is the variable number of CPUs used, thus making Nexperia fit well various applications. A specific implementation of Nexperia, the PNX8550 system chip, houses 10 million gates in 62 cores, out of which five are hard (including the MIPS and Trimedia CPUs) and the others are soft cores [15].
The Xilinx Spartan-II FPGA chips are
rectangular arrays of configurable logic blocks (CLBs). Each block can
be programmed to perform a specificlogic function. CLBs are connected
via a hierarchy of routing channels. A more complex and interesting
family of products is the Xilinx Virtex-II and Virtex-II Pro. These
FPGAs have various complex elements, such as CLBs, RAMs, processor
cores, multipliers and clock managers. Programmable interconnection is achieved by routing switches. Each programmable element is connected to a switch matrix, allowing multiple connections to the general routing matrix. All programmable elements, including the routing resources, are controlled by values stored in static memory cells. Thus, Virtex-II can be also seen as NoC over a heterogeneous fabric of components. The complexity of the chip designs described above has prompted the development of infrastructure to support communication. For example, STMicroelectronics has developed the STBus kit that can provide various functions including full (and partial) crossbar connection. A similar framework is provided by the advanced microcontroller bus architecture (AMBA) multi-layer bus system. Distinguishing Characteristics of NoCs SoCs differ from wide-area networks because of local proximity and because they exhibit much less non-determinism. Indeed, despite the undesirable variability features of DSM CMOS technologies, it is still possible to predict many physical and electrical parameters with reasonable accuracy. On the other hand, on-chip networks
have a few distinctive characteristics, namely low communication
latency, energy consumption constraints and design-time specialization.
Latency of communication on chip needs to be small, that is, in the
order of few clock periods. The shortest latency implementations can be achieved by fully hard-wired implementations, which defeat the ýexibility required by on-chip networks. Clearly, smart protocols for communication may add to the latency of the signals. Thus, to be competitive in performance, NoCs require streamlined protocols. Energy consumption in NoCs is often a
major concern, because whereas computation and storage energy greatly
benefits from device scaling (smaller gates and smaller memory cells),
the energy for global communication does not scale down. On the contrary, projections based on
current delay optimization techniques for global wires [18, 29, 31]
show that global communication on chip will require increasingly higher
energy consumption. Hence, communication-energy minimization will be a
growing concern in future technologies. Furthermore, network traffic control and monitoring can help in better managing the power consumed by networked computational resources. For instance, clock speed and voltage of end nodes can be varied according to available network bandwidth. Design-time specialization is another
facet of NoC design, and it is relevant to application-specific and
platform SoCs. Whereas macroscopic networks emphasize general-purpose
communication and modularity, in NoCs these constraints are less
restrictive because most on-chip solutions are proprietary. Thus, NoC implementation may separate
data from control, use arbitrary bus width and control ýow schemes.
Such a ýexibility needs to be mitigated at the NoC boundary, that is,
where the communication infrastructure connects to end nodes (e.g.,
processors). Existing standards like the Open Core Protocol (OCP) are extremely useful in defining the interface between processor/storage arrays and NoCs. Interestingly enough, the ýexibility in tailoring the NoC to the specific application can be used effectively to design low-energy communication schemes.
To read Part 1 go to "Why on-chip networking?" Used with the permission of the publisher, Newnes/Elsevier, this series of six articles is based on material from "Networks On Chips: Technology and Tools," by Luca Benini and Giovanni De Micheli.
Luca Benini is
professor at the Department of Electrical Engineering and Computer
Science at the University of Bologna, Italy. Giovanni De Micheli is
professor and director of the Integrated Systems Center at EPF in
Lausanne, Switzerland. [1] A. Adriahantenaina, H. Charlery, A. Greiner, L. Mortiezand and C. Zeferino, "SPIN: A Scalable, Packet Switched, On-Chip Micro-network,''DATE - Design, Automation and Test in Europe Conference and Exhibition, 2003, pp. 70 -73 . [2]A.H. Ajami, K. Banerjee and M. Pedram, "Modeling and Analysis of Nonuniform Substrate Temperature Effects on Global ULSI Interconnects,'' IEEE Transactions on CAD, Vol. 24, No. 6, June 2005, pp. 849 - 861. [3] H. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley, Upper Saddle River, NJ, 1990. [4] L. Benini, A. Bogliolo and G. De Micheli, "A Survey of Design Techniques for System-Level Dynamic Power Management,'' IEEE Transactions on Very Large-Scale Integration Systems, Vol. 8, No. 3, June 2000, pp. 299 - 316. [5]W.O. Cesario, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, L. Gauthier, M. Diaz-Nava and A.A. Jerraya, "Multiprocessor SoC Platforms: A Component-Based Design Approach,'' IEEE Design and Test of Computers, Vol. 19, No. 6, November"December 2002, pp. 52 - 63. [6]W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, San Francisco, CA, 2004. [7]W. Dally and B. Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks,'' Proceedings of the 38th Design Automation Conference. 2001. [8]W.J. Dally and H. Aoki, "Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels,'' IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 4, April 1993, pp. 466 - 475. [9]W. Dally and C. Seitz, "The Torus Routing Chip,'' Distributed Processing, Vol. 1, 1996, pp. 187 - 196. [10]M. Dall'Osso, G. Biccari, L. Giovannini, D. Bertozzi and L. Benini, "Xpipes: A Latency Insensitive Parameterized Network-on-Chip Architecture for Multiprocessor SoCs,'' International Conference on Computer Design, 2003, pp. 536"539. [11]D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kim and K. Flautner, "Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation,'' IEEE Micro, Vol. 24, No. 6, November-December 2004, pp. 10 - 20. [12]W. Dally and J. Poulton, Digital Systems Engineering, Cambridge University Press, Cambridge, MA, 1998. [13]J. Duato, S. Yalamanchili and L. Ni, Interconnection Networks: An Engineering Approach, Morgan Kaufmann, San Francisco, CA, 2003. [14]T. Dumitra, S. Kerner and R. Marculescu, "Towards On-Chip Fault-Tolerant Communication,'' ASPDAC - Proceedings of the Asian-South Paciýc Design Automation Conference, 2003, pp. 225 - 232. [15]S. Goel, K. Chiu, E. Marinissen, T. Nguyen and S. Oostdijk, "Test Infrastructure Design for the Nexperia Home Platform PNX8550 System Chip,'' DATE - Proceedings of the Design Automation and Test Europe Conference, 2004. [16]K. Goossens, J. van Meerbergen, A. Peeters and P. Wielage, "Networks on Silicon: Combining Best Efforts and Guaranteed Services,'' Design Automation and Test in Europe Conference, 2002, pp. 423 - 427. [17]R. Hegde and N. Shanbhag, "Toward Achieving Energy Efýciency in Presence of Deep Submicron Noise,'' IEEE Transactions on VLSI Systems, Vol. 8, No. 4, August 2000, pp. 379 - 391. [18]R. Ho, K. Mai and M. Horowitz, "The Future of Wires,'' Proceedings of the IEEE, January 2001. [19]J. Hu and R. Marculescu, "Energy-Aware Mapping for Tile-Based NOC Architectures Under Performance Constraints,'' Asian-Pacific Design Automation Conference, 2003. [20]F. Karim, A. Nguyen and S. Dey, "On-Chip Communication Architecture for OC-768 Network Processors,'' Proceedings of the 38th Design Automation Conference, 2001. [21]B. Khailany, et al., "Imagine: Media Processing with Streams,'' IEEE Micro, Vol. 21, No. 2, 2001, pp. 35"46. [22]S. Kumar, et al., "A Network on Chip Architecture and Design Methodology,'' VLSI on Annual Symposium, IEEE Computer Society ISVLSI 2002. [23]D. Lackey, P. Zuchowski, T. Bednar, D. Stout, S. Gould and J. Cohn, "Managing Power and Performance for Systems on Chip Design Using Voltage Islands,'' ICCAD - International Conference on Computer Aided Design, 2002, pp. 195 - 202. [24]P. Lieverse, P. van der Wolf, K. Vissers and E. Deprettere, "A Methodology for Architecture Exploration of Heterogeneous Signal Processing Systems,'' Journal of VLSI Signal Processing for Signal, Image and Video Technology, Vol. 29, No. 3, 2001, pp. 197 - 207. [25]M. Oka and M. Suzuoki, "Designing and Programming the Emotion Engine,'' IEEE Micro, Vol. 19, No. 6, November - December 1999, pp. 20 - 28. [26]D. Pham, et al., "Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor,'' IEEE Journal of Solid-State Circuits, Vol. 41, No. 1, January 2006, pp. 179 - 196. [27]A. Pinto, L. Carloni and A. Sangiovanni-Vincentelli, "Constraint-Driven Communication Synthesis,'' Design Automation Conference, 2002, pp. 195 - 202. [28]K. Skadron, et al., "Temperature-Aware Computer Systems: Opportunities and Challenges,'' IEEE Micro, Vol. 23, No. 6, November"December 2003, pp. 52 - 61. [29]D. Sylvester and K. Keutzer, "A Global Wiring Paradigm for Deep Submicron Design,'' IEEE Transactions on CAD/ICAS, Vol. 19, No. 2, February 2000, pp. 242 - 252. [30]R. Tamhankar, S. Murali and G. De Micheli, "Performance Driven Reliable Link for Networks on Chip,'' ASPDAC - Proceedings of the Asian Paciýc Conference on Design Automation, Shahghai, 2005, pp. 749 - 754. [31]T. Theis, "The Future of Interconnection Technology,'' IBM Journal of Research and Development, Vol. 44, No. 3, May 2000, pp. 379"390. [32]E. Waingold, et al., "Baring It All to Software: Raw Machines,'' IEEE Computer, Vol. 30, No. 9, September 1997, pp. 86 - 93. [33]J. Walrand and P. Varaiya, High-Performance Communication Networks, Morgan Kaufmann, San Francisco, CA, 2000. [34]M. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley, Upper Saddle River, NJ, 1995. [35]F. Worm, P. Ienne, P. Thiran and G. De Micheli, "An Adaptive Low-Power Transmission Scheme for On-Chip Networks,'' ISSS, Proceedings of the International Symposium on System Synthesis, Kyoto, October 2002, pp. 92 - 100. [36] H. Zhang, V. George and J. Rabaey, "Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness,'' IEEE Transactions on VLSI Systems, Vol. 8, No. 3, June 2000, pp. 264 - 272. [37] International Technology Roadmap for Semiconductors (http://public.itrs.net/)
|
|
||||||||||||||||||||||||||||||||||||
|
|
|
|