FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
High Performance Computing
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
February 12, 2007
Next-gen Multicore Networks-on-chip Systems: Part 2

Luca Benini and Giovanni De Micheli
In the second in a six part series based on their book "Networks On Chips," Luca Benini and Giovanni De Micheli describe the architectural, programming and debug challenges of nextgen multicore networks-on-chips. This week: SoC objectives and NoC needs

There are several hardware types of SoC designs that can be defined according to the required functionality and market. In general, SoCs can be classified in terms of their versatility (i.e., support for programming) and application domains. A simple taxonomy is described next:

General-purpose on-chip multiprocessors are high-performance chips that benefit from spatial locality to achieve high performance. They are designed to support various applications, and thus the processor core usage and traffic patterns may vary widely. They are the evolution of on-board multiprocessors, and they are typified by having a homogeneous set of processing and storage arrays.

For these reasons, on-chip network design can benefit from the experience on many architectures and techniques developed for on-board multiprocessors, with the appropriate adjustments to operate on a silicon substrate.

Figure 1.6. Razor [11] is another realization of self-calibrating circuits, where a processor's supply is lowered till errors occur. The correct operation of the processor is preserved by an error detection and pipeline adjustment technique. As a result, the processor settles on-line to an operating voltage which minimizes the energy consumption even in the presence of variation of technological parameters.

Application-specific SoCs are hardware chips dedicated to an application. In some cases, as for all mobile applications, energy consumption is a major concern. Most application-specific SoCs are programmable, but their application domain is limited and the software characteristics are known a priori.

Thus, some knowledge of the traffic pattern is available when the NoC is designed. In many cases, these systems contain fairly heterogeneous computing elements, such as processors, controllers, digital signal processors (DSPs) and a number of domain-specific hardware accelerators. This heterogeneity may lead to specific traffic patterns and requirements, thus requiring NoCs with specialized architectures and protocols.

SoC platforms are application-specific SoCs dedicated to a family of applications in a specific domain. Examples are SoCs for GSM telephony support and platforms for automotive control. A platform is more versatile in nature, as it can be used in different (embedded) systems by different manufacturers.

Figure 1.7. T-error is a timing methodology for NoCs where data is pipelined through double latches, where the former used an aggressive period and the latter a safe one. For most patterns, T-error will forward data from the first latch. When the slowest patterns are transmitted that fail the deadline at the first latch, correct but slower operation is performed by the second latch [30].

Thus, versatility and programmability are preferred to customization, yielding SoCs that can be produced in high volumes, and thus offset the non-recurrent engineering (NRE) costs. Whereas the processing and storage unit may differ in nature and performance, the traffic patterns are harder to guess a priori as the application software may vary widely.

Field-programmable gate arrays (FPGAs) are hardware systems where the functionality is determined after manufacturing by connecting and configuring components. Components vary in size and in functionality and are connected by reprogrammable networks.

These networks are simple and provide bit-level connectivity with little or no control. Nevertheless we expect FPGAs to grow substantially over the coming years and require effective NoC communication.

Some Design Examples

One of the first multiprocessor designed around an NoC is the RAW architecture [32]. This is a fully programmable SoC consisting of an array of identical computational tiles with local storage. Full programmability means that the compiler can program both the function of each tile and the interconnections among them. The name RAW stems from the fact that the "raw'' hardware is fully exposed to the compiler.

To accomplish programmable communication, each tile has a router. The compiler programs the routers on all tiles to issue a sequence of commands that determine exactly which set of wires connect at every cycle. Moreover, the compiler pipelines the long wires to support high clock frequency. 

The Cell processor [26] was developed by Sony, Toshiba and IBM to build a general-purpose processor for a computer, even though it is primarily targeted for Sony's Playstation 3. Its architecture resembles multiprocessor vector supercomputers, targeting high-performance distributed computing.

The architecture comprises one 64-bit power processor element (PPE), eight synergistic processor elements (SPEs), memory and interconnection. The PPE is a dual issue, dual threaded in-order RISC processor, with 512K cache. Each SPE is a self-contained in-order vector processor which acts as an independent processor.

Each contains a 128 × 128-bit register, four (single-precision) ýoating point units and four integer units. The element interconnection bus (EIB) connects the PPE, the eight SPEs and the memory interface controller (Figure 1.8, below). The EIB has independent networks for commands (requests for data from other sources) and for the data being moved.

Commands are filtered through address concentrators which handle collision detection and prevention, and ensure that all units have equal access to the command bus. There are multiple address concentrators, all of which forward data to a single-serial command reýection point. Data transfer is elaborate.

There are four "rings,'' each of which is a chain connecting all data ports. Data can move down a ring only in one direction. For instance, a connection that allows data to move from the PPE to SPE1 cannot be used to move data from SPE1 back to the PPE.

Two rings go clockwise and two counterclockwise, and all four rings have the components attached in the same order. Each ring can move 16 bytes at a time from any position on the ring to any other position. In fact, each ring can transmit three concurrent transfers, but those transfers cannot overlap.

Figure 1.8. The Element Interconnection Bus (EIB) in Cell.

The Nexperia architecture, developed by Philips (NXP Semiconductdor), is a platform for handling digital video and audio in consumer electronics (Figure 1.9, below). It uses one or more 32-bit MIPS CPUs for control processing, and one or more 32-bit Trimedia processors for streaming data. Moreover, the platform can house a ýexible range of programmable modules, such as an MPEG decoder, a UART, etc.

To connect the CPUs and other modules with each other and with the main external memory, a high-speed memory access network, and two device control and status (DCS) networks are used. These DCS networks enable each processor to control and observe on chip the status of the other modules.

One of the advantages of the platform is the variable number of CPUs used, thus making Nexperia fit well various applications. A specific implementation of Nexperia, the PNX8550 system chip, houses 10 million gates in 62 cores, out of which five are hard (including the MIPS and Trimedia CPUs) and the others are soft cores [15].

Figure 1.9. The Nexperia architecture.

The Xilinx Spartan-II FPGA chips are rectangular arrays of configurable logic blocks (CLBs). Each block can be programmed to perform a specificlogic function. CLBs are connected via a hierarchy of routing channels. A more complex and interesting family of products is the Xilinx Virtex-II and Virtex-II Pro. These FPGAs have various complex elements, such as CLBs, RAMs, processor cores, multipliers and clock managers.

Programmable interconnection is achieved by routing switches. Each programmable element is connected to a switch matrix, allowing multiple connections to the general routing matrix. All programmable elements, including the routing resources, are controlled by values stored in static memory cells. Thus, Virtex-II can be also seen as NoC over a heterogeneous fabric of components.

The complexity of the chip designs described above has prompted the development of infrastructure to support communication. For example, STMicroelectronics has developed the STBus kit that can provide various functions including full (and partial) crossbar connection. A similar framework is provided by the advanced microcontroller bus architecture (AMBA) multi-layer bus system.

Distinguishing Characteristics of NoCs

SoCs differ from wide-area networks because of local proximity and because they exhibit much less non-determinism. Indeed, despite the undesirable variability features of DSM CMOS technologies, it is still possible to predict many physical and electrical parameters with reasonable accuracy.

On the other hand, on-chip networks have a few distinctive characteristics, namely low communication latency, energy consumption constraints and design-time specialization. Latency of communication on chip needs to be small, that is, in the order of few clock periods.

The shortest latency implementations can be achieved by fully hard-wired implementations, which defeat the ýexibility required by on-chip networks. Clearly, smart protocols for communication may add to the latency of the signals. Thus, to be competitive in performance, NoCs require streamlined protocols.

Energy consumption in NoCs is often a major concern, because whereas computation and storage energy greatly benefits from device scaling (smaller gates and smaller memory cells), the energy for global communication does not scale down.

On the contrary, projections based on current delay optimization techniques for global wires [18, 29, 31] show that global communication on chip will require increasingly higher energy consumption. Hence, communication-energy minimization will be a growing concern in future technologies.

Furthermore, network traffic control and monitoring can help in better managing the power consumed by networked computational resources. For instance, clock speed and voltage of end nodes can be varied according to available network bandwidth.

Design-time specialization is another facet of NoC design, and it is relevant to application-specific and platform SoCs. Whereas macroscopic networks emphasize general-purpose communication and modularity, in NoCs these constraints are less restrictive because most on-chip solutions are proprietary.

Thus, NoC implementation may separate data from control, use arbitrary bus width and control ýow schemes. Such a ýexibility needs to be mitigated at the NoC boundary, that is, where the communication infrastructure connects to end nodes (e.g., processors).

Existing standards like the Open Core Protocol (OCP) are extremely useful in defining the interface between processor/storage arrays and NoCs. Interestingly enough, the ýexibility in tailoring the NoC to the specific application can be used effectively to design low-energy communication schemes.

To read Part 1 go to  "Why  on-chip networking?"
Next in Part 3:  Once over lightly - a survey of the issues related to NoC design.

Used with the permission of the publisher, Newnes/Elsevier, this series of six articles is based on material from "Networks On Chips: Technology and Tools," by Luca Benini and Giovanni De Micheli.

Luca Benini is professor at the Department of Electrical Engineering and Computer Science at the University of Bologna, Italy. Giovanni De Micheli is professor and director of the Integrated Systems  Center at EPF in Lausanne, Switzerland.

References
[1] A. Adriahantenaina, H. Charlery, A. Greiner, L. Mortiezand and C. Zeferino, "SPIN: A Scalable, Packet Switched, On-Chip Micro-network,''DATE - Design, Automation and Test in Europe Conference and Exhibition, 2003, pp. 70 -73 .
[2]A.H. Ajami, K. Banerjee and M. Pedram, "Modeling and Analysis of Nonuniform Substrate Temperature Effects on Global ULSI Interconnects,'' IEEE Transactions on CAD, Vol. 24, No. 6, June 2005, pp. 849 - 861.
[3] H. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley, Upper Saddle River, NJ, 1990.
[4] L. Benini, A. Bogliolo and G. De Micheli, "A Survey of Design Techniques for System-Level Dynamic Power Management,'' IEEE Transactions on Very Large-Scale Integration Systems, Vol. 8, No. 3, June 2000, pp. 299 - 316.
[5]W.O. Cesario, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, L. Gauthier,
M. Diaz-Nava and A.A. Jerraya, "Multiprocessor SoC Platforms: A Component-Based Design Approach,'' IEEE Design and Test of Computers, Vol. 19, No. 6, November"December 2002, pp. 52 - 63.
[6]W. Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, San Francisco, CA, 2004.
[7]W. Dally and B. Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks,'' Proceedings of the 38th Design Automation Conference. 2001.
[8]W.J. Dally and H. Aoki, "Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels,'' IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 4, April 1993, pp. 466 - 475.
[9]W. Dally and C. Seitz, "The Torus Routing Chip,'' Distributed Processing, Vol. 1, 1996, pp. 187 - 196.
[10]M. Dall'Osso, G. Biccari, L. Giovannini, D. Bertozzi and L. Benini, "Xpipes: A Latency Insensitive Parameterized Network-on-Chip Architecture for Multiprocessor SoCs,'' International Conference on Computer Design, 2003, pp. 536"539.
[11]D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kim and K. Flautner, "Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation,'' IEEE Micro, Vol. 24, No. 6, November-December 2004, pp. 10 - 20.
[12]W. Dally and J. Poulton, Digital Systems Engineering, Cambridge University Press, Cambridge, MA, 1998.
[13]J. Duato, S. Yalamanchili and L. Ni, Interconnection Networks: An Engineering Approach, Morgan Kaufmann, San Francisco, CA, 2003.
[14]T. Dumitra, S. Kerner and R. Marculescu, "Towards On-Chip Fault-Tolerant Communication,'' ASPDAC - Proceedings of the Asian-South Paciýc Design Automation Conference, 2003, pp. 225 - 232.
[15]S. Goel, K. Chiu, E. Marinissen, T. Nguyen and S. Oostdijk, "Test Infrastructure Design for the Nexperia Home Platform PNX8550 System Chip,'' DATE - Proceedings of the Design Automation and Test Europe Conference, 2004.
[16]K. Goossens, J. van Meerbergen, A. Peeters and P. Wielage, "Networks on Silicon: Combining Best Efforts and Guaranteed Services,'' Design Automation and Test in Europe Conference, 2002, pp. 423 - 427.
[17]R. Hegde and N. Shanbhag, "Toward Achieving Energy Efýciency in Presence of Deep Submicron Noise,'' IEEE Transactions on VLSI Systems, Vol. 8, No. 4, August 2000, pp. 379 - 391.
[18]R. Ho, K. Mai and M. Horowitz, "The Future of Wires,'' Proceedings of the IEEE, January 2001.
[19]J. Hu and R. Marculescu, "Energy-Aware Mapping for Tile-Based NOC Architectures Under Performance Constraints,'' Asian-Pacific Design Automation Conference, 2003.
[20]F. Karim, A. Nguyen and S. Dey, "On-Chip Communication Architecture for OC-768 Network Processors,'' Proceedings of the 38th Design Automation Conference, 2001.
[21]B. Khailany, et al., "Imagine: Media Processing with Streams,'' IEEE Micro, Vol. 21, No. 2, 2001, pp. 35"46.
[22]S. Kumar, et al., "A Network on Chip Architecture and Design Methodology,'' VLSI on Annual Symposium, IEEE Computer Society ISVLSI 2002.
[23]D. Lackey, P. Zuchowski, T. Bednar, D. Stout, S. Gould and J. Cohn, "Managing Power and Performance for Systems on Chip Design Using Voltage Islands,'' ICCAD -  International Conference on Computer Aided Design, 2002, pp. 195 - 202.
[24]P. Lieverse, P. van der Wolf, K. Vissers and E. Deprettere, "A Methodology for Architecture Exploration of Heterogeneous Signal Processing Systems,'' Journal of VLSI Signal Processing for Signal, Image and Video Technology, Vol. 29, No. 3, 2001, pp. 197 - 207.
[25]M. Oka and M. Suzuoki, "Designing and Programming the Emotion Engine,'' IEEE Micro, Vol. 19, No. 6, November - December 1999, pp. 20 - 28.
[26]D. Pham, et al., "Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor,'' IEEE Journal of Solid-State Circuits, Vol. 41, No. 1, January 2006, pp. 179 - 196.
[27]A. Pinto, L. Carloni and A. Sangiovanni-Vincentelli, "Constraint-Driven Communication Synthesis,'' Design Automation Conference, 2002, pp. 195 - 202.
[28]K. Skadron, et al., "Temperature-Aware Computer Systems: Opportunities and Challenges,'' IEEE Micro, Vol. 23, No. 6, November"December 2003, pp. 52 - 61.
[29]D. Sylvester and K. Keutzer, "A Global Wiring Paradigm for Deep Submicron Design,'' IEEE Transactions on CAD/ICAS, Vol. 19, No. 2, February 2000, pp. 242 - 252.
[30]R. Tamhankar, S. Murali and G. De Micheli, "Performance Driven Reliable Link for Networks on Chip,'' ASPDAC - Proceedings of the Asian Paciýc Conference on Design Automation, Shahghai, 2005, pp. 749 - 754.
[31]T. Theis, "The Future of Interconnection Technology,'' IBM Journal of Research and Development, Vol. 44, No. 3, May 2000, pp. 379"390.
[32]E. Waingold, et al., "Baring It All to Software: Raw Machines,'' IEEE Computer, Vol. 30, No. 9, September 1997, pp. 86 - 93.
[33]J. Walrand and P. Varaiya, High-Performance Communication Networks, Morgan Kaufmann, San Francisco, CA, 2000.
[34]M. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley, Upper Saddle River, NJ, 1995.
[35]F. Worm, P. Ienne, P. Thiran and G. De Micheli, "An Adaptive Low-Power Transmission Scheme for On-Chip Networks,'' ISSS, Proceedings of the International Symposium on System Synthesis, Kyoto, October 2002, pp. 92 - 100.
[36] H. Zhang, V. George and J. Rabaey, "Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness,'' IEEE Transactions on VLSI Systems, Vol. 8, No. 3, June 2000, pp. 264 - 272.
[37] 
International Technology Roadmap for Semiconductors (http://public.itrs.net/)


RELATED ARTICLES
No Related Articles
TOP 5 ARTICLES
No Top Articles.



MICROSITES
FEATURED TOPIC

ADDITIONAL TOPICS

INFO-LINK