Multicore Processors

How long will the pain last?


January 13, 2009
URL:http://drdobbs.com/parallel/multicore-processors/212900103

Richard Kaufmann is a Distinguished Technologist at Hewlett-Packard. Bruce Gayliard is a High Performance Computing Software Engineer at Hewlett-Packard.


Until a few years ago, the processor hardware community translated Moore's Law of transistor density directly into single-threaded performance gains as a result of increasing clock frequencies. Lately, this translation has been hampered by the effects of clock frequency on power consumption and heat generation. The new reality is that per-thread performance is essentially static, and an increase in performance is delivered by an increase in the number of available processor cores per socket. This greatly simplifies a processor developer's job in two ways: First, it avoids radically increasing per-socket power consumption; and second, it holds the complexity of a single core somewhat static over time. Thus, processor design remains a tractable discipline.

However, nothing in life comes for free. Just as RISC stood for "Relegate the Important Stuff to the Compiler," increasing the number of threads per socket places an enormous stress -- to parallelize applications and reap the performance increases -- on almost the entire computing industry. Those affected include the system manufacturers, the compiler writers (who are used to being the punching bags), the operating system designers, the application writers, and finally (and inevitably) the users. The single-threaded application typically does not benefit from multicore hardware, it may actually suffer degraded performance; and the great majority of applications on the market today are single-threaded in design. In general, clock frequency increases allowed developers to avoid the inevitable transition to multiprocessor environments (and thus the parallelization of their applications), but this free ride is now over.

In addition to the difficulties of parallelizing an application, there is a limit to the performance gain that a parallelized version of an application can achieve. This is defined by a variant of "Amdahl's Law" (which fundamentally states that an application cannot run faster than the aggregate of its sequential components). An analysis of the application may indicate that parallelization at the code level will not yield the sought-after performance gains, and that some other approach (for instance, running multiple copies of a program to increase throughput) will be more effective. Furthermore, on the hardware front, increasing core counts will result in pressure on communication bandwidth capacities, particularly for memory accesses. Maintaining the traditional clocked bus architectures will result in a need for increased bus clock frequencies, again pushing on the very power dissipation issues that resulted in the move to multicore architectures in the first place. Increasing the width of the involved busses may provide some temporary relief, but that approach can only be viewed as a patch.

Despite these challenges, the component manufacturers (Intel, AMD, etc.) are gambling that system and software developers will pick up the ball and move toward parallelizing their applications quickly enough to keep up with the core count arms race.

Okay, It's Not All Doom and Gloom

Luckily, some applications are inherently scalable and can accommodate exponential growths in data by applying additional computing resources. For example, digital content creators generally assign one frame of a movie to a single thread. It really doesn't matter how long that thread takes to complete as long as the number of thread completions per unit of time scales appropriately. Other examples include some High Performance Computing (HPC) applications (often developed using the Message Passing Interface (MPI) or OpenMP programming models) or other applications developed using Google's MapReduce library (see also Apache Hadoop).

Furthermore, a nice property of the transition from clock frequency increases to multicore processors is that memory latency is no longer a worsening problem. Those of us who remember the utter joy of upgrading from 026 to 029 keypunches (the 10 or so of you who are now in danger of dropping your dentures!) also remember when memory systems ran at the same speed as the CPU; a word of memory was one clock tick away from the CPU. As processor speeds increased, this relationship failed to keep pace, up to the point where at 3GHz (.333ns cycle) and approximately 80ns latency, a word of memory is approximately 240 ticks away from the CPU.

A particularly helpful side effect of keeping a constant CPU frequency is that the relative time (in clock ticks) to access memory no longer increases with new generations of processors.

Many applications aren't able to take full advantage of the current CPU clock speeds as their performances are dictated by memory latency (for example, programs that spend most of their time chasing linked lists). If many copies of these programs are run in parallel, latency can be overcome -- the cores collectively can provide enough outstanding loads to "hide" the underlying memory system latency. In fact, early multicore systems were designed with such workloads in mind.

Also, keeping a relatively stable clock speed has had a positive affect on CPU power consumption. Calculations by Intel for 65nm CPUs showed that increasing the core frequency by a factor of 1.3 would double CPU power consumption. Doubling the core count and retarding the clock by 10 percent, in contrast, would boost performance by a factor of 1.8 with no increase in CPU power consumption.

The Reality

But what if the application can't be scaled by simply increasing the number of independent runs of identical programs? What if you want to double the speed of your single application every 18 to 24 months in accordance with the rights and privileges afforded by the prior single-threaded performance aspect of Moore's Law? And you want to do this without having to become a wizard at parallelizing your program, or even worse, rewriting your application? If you're busy nodding your head up and down, you'll agree with this statement: Exploiting parallelism within a socket should be the responsibility of the compilers, tools, schedulers, operating systems, and so on. It's okay if parallelism across sockets remains "hard," but single socket stuff should just happen automatically.

Unfortunately, the industry does not have the technology to accept the responsibility of automatically taking advantage of multicore processors. It is up to the software developer to rewrite an application, and develop an understanding of parallelism in order to do so.

The Solution: A Collaborative Approach

With these issues in mind, there are many industry specialists that are devising technologies to make the parallelization task easier and more productive. However, technology is not enough. The entire industry must work together to address "the multicore menace." Areas to be addressed include:

This list is a good start, but not all inclusive. It is imperative that the industry distinguishes the pros and cons of each technology, identifies the preferences of the developers and users, and addresses those preferences. The industry must quickly provide effective solutions to parallelize applications, and it is vital that developers and users are involved in defining those solutions.

Along these lines, a number of developments have progressed. Some examples:

What's Next?

The memory latency aspect mentioned earlier may work to the advantage of the multicore designers for a while, because an increase in the number of cores will come with an increase in memory bus bandwidth requirements. A fundamental change in the memory subsystem will eventually be required to keep up with core counts. One such technology might be stacked memory -- a technique that places memory on a die directly above or below the CPU. Each core would have access to the memory directly attached to it, and bandwidth could scale with core counts. Another possible choice is to replace the copper connections to memory with optical links. A move away from the traditional globally clocked bus architecture is also possible, replacing it with, for example, a self-timed communications fabric. A melding of techniques may be necessary to give the right combination of price, power consumption, capacity, and bandwidth.

Although it makes sense in 2008 to assume a modest number of sockets will have access to a coherent shared memory, there may come a time when this is no longer true. For example, it may be possible that even cores within a socket may be partitioned into multiple coherency domains. This is game changing: Applications written today to be "multicore aware" quite commonly assume coherent memory access across all cores. When this is no longer true, other techniques (such as message passing) are required to take advantage of all the cores. Thus, even single socket applications might look more like cluster-aware distributed memory codes currently popular in the supercomputing community.

Conclusion

As stated, though multicore techniques have made processor design more practical, they have complicated the lives of application developers, tool writers, and so on. Therefore, the entire computer ecosystem (academic and commercial) must respond to this challenge, and quickly develop a set of technologies to allow developers and users to take advantage of the potential performance improvements afforded by new platforms. Though this is well underway, the pain may last for years while application developers gain an understanding of how to use these emerging technologies. They must adjust to the relatively constant clock rates of newly developed processors and take advantage of the explosion of available computing resources in the form of cores. The hardware is way ahead of everyone else at this point, and catching up will be very, very difficult.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.