Design for Manycore Systems

Tools
  • Print
  • Email
Why worry about "manycore" today?

Herb Sutter is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca.


Dual- and quad-core computers are obviously here to stay for mainstream desktops and notebooks. But do we really need to think about "many-core" systems if we're building a typical mainstream application right now? I find that, to many developers, "many-core" systems still feel fairly remote, and not an immediate issue to think about as they're working on their current product.

This column is about why it's time right now for most of us to think about systems with lots of cores. In short: Software is the (only) gating factor; as that gate falls, hardware parallelism is coming more and sooner than many people yet believe.

Recap: What "Everybody Knows"

Figure 1 is the canonical "free lunch is over" slide showing major mainstream microprocessor trends over the past 40 years. These numbers come from Intel's product line, but every CPU vendor from servers (e.g., Sparc) to mobile devices (e.g., ARM) shows similar curves, just shifted slightly left or right. The key point is that Moore's Law is still generously delivering transistors at the rate of twice as many per inch or per dollar every couple of years. Of course, any exponential growth curve must end, and so eventually will Moore's Law, but it seems to have yet another decade or so of life left.

Figure 1: Canonical "free lunch is over" slide. Note Pentium vs. dual-core Itanium transistor counts.

Mainstream microprocessor designers used to be able to use their growing transistor budgets to make single-threaded code faster by making the chips more complex, such as by adding out-of-order ("OoO") execution, pipelining, branch prediction, speculation, and other techniques. Unfortunately, those techniques have now been largely mined out. But CPU designers are still reaping Moore's harvest of transistors by the boatload, at least for now. What to do with all those transistors? The main answer is to deliver more cores rather than more complex cores. Additionally, some of the extra transistor real estate can also be soaked up by bringing GPUs, networking, and/or other functionality on-chip as well, up to putting an entire "system on a chip" (aka "SoC") like the Sun UltraSPARC T2.

How Much, How Soon?

How quickly can we expect more parallelism in our chips? The nave answer would be: Twice as many cores every couple of years, just continuing on with Moore's Law. That's the baseline projection approximated in Figure 2, assuming that some of the extra transistors aren't also used for other things.

Figure 2: Simple extrapolation of "more of the same big cores" (not counting some transistors being used for other things like on-chip GPUs, or returning to smaller cores).

However, the naive answer misses several essential ingredients. To illustrate, notice one interesting fact hidden inside Figure 1. Consider the two highlighted chips and their respective transistor counts in million transistors (Mt):

  • 4.5Mt: 1997 "Tillamook" Pentium P55C. This isn't the original Pentium, it's a later and pretty attractive little chip that has some nice MMX instructions for multimedia processing. Imagine running this 1997 part at today's clock speeds.
  • 1,700Mt: 2006 "Montecito" Itanium 2. This chip handily jumped past the billion-transistor mark to deliver two Itanium cores on the same die. [1]

So what's the interesting fact? (Hint: 1,700 ÷ 4.5 = ???.)

In 2006, instead of shipping a dual-core Itanium part, with exactly the same transistor budget Intel could have shipped a chip that contained 100 decent Pentium-class cores with enough space left over for 16 MB of Level 3 cache. True, it's more than a matter of just etching the logic of 100 cores on one die; the chip would need other engineering work, such as in improving the memory interconnect to make the whole chip a suitably balanced part. But we can view those as being relatively 'just details' because they don't require engineering breakthroughs.

Repeat: Intel could have shipped a 100-core desktop chip with ample cache -- in 2006. So why didn't they? (Or AMD? Or Sun? Or anyone else in the mainstream market?) The short answer is the counter-question: Who would buy it? The world's popular mainstream client applications are largely single-threaded or nonscalably multithreaded, which means that existing applications create a double disincentive:

  • They couldn't take advantage the extra cores, because they don't contain enough inherent parallelism to scale well.
  • They wouldn't run as fast on a smaller and simpler core, compared to a bigger core that contains extra complexity to run single-threaded code faster.

Astute readers might have noticed that when I said, "why didn't Intel or Sun," I left myself open to contradiction, because Sun (in particular) did do something like that already, and Intel is doing it now. Let's find out what, and why.

PAGE 1 | 2 | 3 Next Page »

Real World Parallelism Webinar Series
  • November 17, 2009
    Visual Effects for Animation - presented by DreamWorks Animation
    Speaker: Ron Henderson (Bio)

    Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).

    Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.

    Abstract:
    In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot.

  • December 1, 2009
    A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
    Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)

    Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.

    Abstract:
    Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code.

  • December 15, 2009
    How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
    Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)

    Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.

    Abstract:
    Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.