Parallel

Understanding Parallel Performance

By Herb Sutter, October 31, 2008

Understanding parallel performance. How do you know when good is good enough?

Let's say that we've slickly written our code to apply divide-and-conquer algorithms and concurrent data structures and parallel traversals and all our other cool tricks that make our code wonderfully scalable in theory. Question: How do we know how well we've actually succeeded? Do we really know, or did we just try a couple of tests on a quad-core that looked reasonable and call it good? What key factors must we measure to understand our code's performance, and answer not only whether our code scales, but quantify how well under different circumstances and workloads? What costs of concurrency do we have to take into account?

This month, I'll summarize some key issues we need to keep in mind to accurately analyze the real performance of our parallel code. I'll list some basic considerations, and then some common costs. Next month, I have a treat in store: We'll take some real code and apply these techniques to analyze its performance in detail as we successively apply a number of optimizations and measure how much each one actually buys us, under what conditions and in what directions, and why.

Fundamentals

To understand our code's scalability, we need to know what to measure and what to look for in the results. First, identify the workload variables: What are the key different kinds of work and/or data? For example, we may want to measure how well a producer-consumer system performs with varying numbers of producer threads and consumer threads, or measure a container by how well it scales when holding data items of different sizes.

Second, use stress tests to measure throughput, or the total amount of work accomplished per unit time, while varying each of these dimensions so that we can measure their relative impact. Look for scalability trends, the change in throughput as we add hardware resources: Can we effectively use more cores to either get the answer faster or to get more work done?

Figures 1 and 2 show two useful visualization tools that will help us understand our code's parallel performance. Figure 1 shows a sample scatter graph that charts throughput results for a selected algorithm against different numbers of two kinds of workers: producer threads and consumer threads. The larger the bubble, the greater the throughput. We can get a sense of how this particular algorithm scales, and in what directions, by examining how and where throughput grows and shrinks. In this example, we have good scalability up to a total of about 15 threads before we start to peak and realize no further gains, and we can see that scalability is better when there are somewhat more producers than consumers in the system.

[Click image to view at full size]

Figure 1: Sample graph for measuring scalability of the same algorithm for different workloads.

Figure 2 directly compares different candidate algorithms running the same kind of workload. Peak throughput occurs, naturally enough, at the peak of each curve. Scalability shows up directly as the left-hand ascending side of the curve; the steeper it is, and the farther to the right that it goes before topping out, the more scalable our code will be. Here, the blue algorithm demonstrates the horror of negative scalability; it actually manages to get less work done using additional cores, which probably won't earn us our next raise.

[Click image to view at full size]

Figure 2: Sample graph for measuring scalability of alternative algorithms for the same workload.

But both figures also let us see two basic penalties. Contention arises when different workers interfere with each other by fighting for resources, such as contending for mutexes, cache space, or cache lines via false sharing. In the most extreme case, adding a new worker might actually cost more in contention than it adds in extra work, resulting in less total work being done, and we can see this extreme effect in both graphs: In Figure 1, in several directions we reach areas where adding more workers makes total throughput actually go down. In Figure 2, we see the same effect in the form of the right-hand downslope where adding more work actually decreases throughput even when there are otherwise-idle cores. But Figure 2 also lets us more clearly see the effect of contention before it gets that far: On the left-hand upslope, even as throughout is still rising, the rate of increase is slowing down as the curve begins to bend. That's a classic effect of growing contention.

The other basic penalty is oversubscription, or having more CPU-bound work ready to execute than we have available hardware. These samples were taken from test runs on a 24-core machine; sure enough, in Figure 1 we see a faint diagonal line where #producers + #consumers = 24, above which throughput is noticeably thinner; sometimes the result is even more dramatic. Similarly, in Figure 2 we see even the best algorithms can't scale beyond the available cores, and incur a penalty for trying to exceed that number by adding contention at least for CPU time and often also for other resources.

With these fundamentals in mind, let's consider a few specific costs that arise and impact scalability because of contention, oversubscription, and other effects.

Sources of Overhead, and Threads versus Pools

We incur a basic concurrency overhead from just expressing code in a parallel way. Consider the following toy example that performs three independent subcomputations to generate a result:


// Example 1: Sequential code in MyApp 1.0
//
int NetSales() {
  int wholesale = CalcWholesale();
  int retail = CalcRetail();
  int returns = TotalReturns();
  return wholesale + retail - returns;
}

Assume that this code is entirely CPU-bound, and that the subcomputations really are independent and free of side effects on each other. Then we can get the answer faster on more cores by running the subparts in parallel. Here's some pseudocode showing how to accomplish this using futures and the convenience of lambda functions, but with a common mistake:


// Example 2 (flawed): Naïve parallel code for MyApp 2.0?
//
int NetSales() {
  // perform the subcomputations concurrently
  future<int> wholesale = new thread( [] {CalcWholesale(); } );
  future<int>retail = new thread( [] {CalcRetail(); } );
  future<int>returns = new thread( [] {TotalReturns(); } );

  // now block for the results
  return wholesale.value()+ retail.value()- returns.value();
}

Seeing the words new Thread explicitly in code is often an indicator that code may not be as scalable as it could be. In most environments, it's much less efficient to spin up a new thread for each piece of work than to run it on a thread pool: First, spinning up a new thread and throwing it away again each time incurs substantially more overhead than giving work to an existing pool thread. Second, spinning up a number of threads and turning them loose to fight for available cores via the operating system scheduler can cause needless contention when we spin up more work than there are cores currently available, which can happen not only on low-core hardware but also on many-core hardware if the application or the system happens to be doing a lot of other work at that moment. Both situations are different kinds of oversubscription, and some threads will have to incur extra context-switching to interleave on an available core. Instead, sharing the core by running one after the other would be both more efficient and more cache-friendly.

Thread pools address these problems because a pool is designed to "rightsize" itself to the amount of hardware concurrency available on the machine. The pool will automatically try to maintain exactly one ready thread per core, and if there is more work than cores the pool naturally queues the work up and lets the machine concentrate on only as many tasks at a time as it has cores to perform. Here's pseudocode for a revised version that uses pools instead:


// Example 3 (better): Partly fixed parallel code for MyApp 2.0
//
int NetSales() {
  // perform the subcomputations concurrently
  future<int> wholesale = pool.run( [] {CalcWholesale(); } );
  future<int> retail = pool.run( [] {CalcRetail(); } );
  future<int> returns = pool.run( [] {TotalReturns(); } );

  // now block for the results
  return wholesale.value() + retail.value() - returns.value();
}

The good news is that this can enable us to get the answer faster on more cores, at least up to three cores. But there are costs, too: With today's thread pools, we typically pay a tax of two context switches when we ship work over to a pool thread and then ship the result back. How can we reduce this cost?

1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Understanding Parallel Performance

Fundamentals

Sources of Overhead, and Threads versus Pools

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Understanding Parallel Performance

Fundamentals

Sources of Overhead, and Threads versus Pools

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content