December 01, 2008
Measuring Parallel Performance: Optimizing a Concurrent QueueA Word About Trendlines
Beware automatic trendlines: They are often useful, but they can also lie.
Look closely again at Figure 6: The trendlines actually obscure what's really going on by creating a false visual smoothness and continuity that our human eyes are all too willing to believe. Try to imagine the graph without any trendlines, and ask yourself: Are these really the trendlines you would draw? Yes, they're close to the data up to about 21 threads. But they under-represent throughput and scalability at the machine-saturation point of 24 threads, and then don't account for the sudden drop and increased variability beyond saturation.
One of the weaknesses of curve-fitting trendlines is that they expect all of the data to fit one consistent pattern, but that is often only true within certain regions. Trendlines don't deal well with discontinuities, where we shift from one region to another one with fundamentally different characteristics and assumptions. Clearly, trying to use fewer cores or more cores than exist on the actual hardware are two different regions, and what the data actually correctly shows is that there's a jump between these regions).
If we replace the automatically generated trendlines with hand-fitted ones, a very different and much truer picture emerges, as shown in Figure 7.
[Click image to view at full size]
Figure 7: Another view of Figure 6, without misleading automatic trendlines.
Now the discontinuity is glaring and clear: On the left-hand graph, we move from a region of consistent and tight throughput increase through a wall, beyond which we experience a sudden drop and find ourselves in a new region where throughput is both dramatically lower and far less consistent and predictablefor example, multiple runs of the same Example 4 code at >24 threads shows dramatically variable results. On the right-hand graph, we move from a region of good and linearly decreasing scalability through a sudden drop, beyond which lies a new region where scalability too is both much lower and more variable. What Have We Learned?
To improve scalability, we need to minimize contention:
To understand our code's scalability, we need to know what to measure and what to look for in the results:
Be a scientist: Gather data. Analyze it. Especially when it comes to parallelism and scalability, there's just no substitute for the advice to measure, measure, measure, and understand what the results mean. Putting together test harnesses and generating and analyzing numbers is work, but the work will reward you with a priceless understanding of how your code actually runs, especially on parallel hardwarean understanding you will never gain from just reading the code or in any other way. And then, at the end, you will ship high-quality parallel code not because you think it's fast enough, but because you know under what circumstances it is and isn't (there will always be an "isn't"), and why.
Notes
[1] H. Sutter. "Writing Lock-Free Code: A Corrected Queue" (DDJ, October 2008).
[2] H. Sutter. "Maximize Locality, Minimize Contention" (DDJ, September 2008). www.ddj.com/architect/208200273.
|
|
||||||||||||||||||||||||||||||
|
|
|
|