FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
High Performance Computing
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
December 01, 2008

Measuring Parallel Performance: Optimizing a Concurrent Queue

(Page 2 of 5)

Example 2: Shrinking the Consumer Critical Section

Our first optimization involves having each node allocate its contained queue item on the heap and hold it by pointer instead of by value. To experienced parallel programmers, performing an extra heap allocation might seem like a bad idea at first, because heap allocations are notorious scalability busters on many of today's memory allocators. But experience is no substitute for measurement: It turns out that, even on a system with a nonscalable allocator, the benefits often outweigh the advantages. Here, holding the queued object by pointer lets us get greater concurrency and scalability among the consumer threads, because we can take the work of actually copying its value out of the critical section of code that updates the shared data structure.

I'll show only the changed lines. In Node itself, of course, we need to hold the values by pointer, and adjust the constructor slightly to match:

// Example 2 (diffs from Example 1): // Moving the copying work out of // the consumer's critical section // struct Node { Node( T* val ) : value(val), next(nullptr) { } T* value; atomic<Node*> next; };

LowLockQueue() { first = divider = last = new Node( nullptr ); producerLock = consumerLock = false; //size = maxSize = 0; }

The key changes we've enabled are in Consume. The key is that we can move the copying of the dequeued object, and the deletion of the value, outside the critical section:

bool Consume( T& result ) { while( consumerLock.exchange(true) ) {} // acquire exclusivity

if( divider->next != nullptr ) { // if queue is nonempty

T* value = divider->next->value; // take it out divider->next->value = nullptr; // of the Node divider = divider->next; // publish that we took an item consumerLock = false; // release exclusivity

result = *value; // now copy it back to the caller delete value; return true; // and report success }

consumerLock = false; // release exclusivity return false; // queue was empty }

Produce only changes in its first line, where new Node( t ) becomes new Node( new T(t) ).

This change should allow better concurrency among consumers. But will it? Before reading on, ask yourself: How would you expect this to affect the magnitude and shape of the performance graphs? How will it affect overall throughput, scalability, contention, and the cost of oversubscription?

Measuring Example 2

Figure 2 shows the results of running Example 2 in the same test harness and on the same hardware. Again, larger circles mean better throughput on the same graph, and pay attention to each graph's "max" (largest circle) and "min" (smallest circle) notes.

Clearly, both throughput and scalability have improved across the board. Our peak throughput is better by a factor of 3 for small objects, and by an order of magnitude for large objects. From this we can deduce that Example 1's throttling of consumers was a severe limiting factor. But the reason peak throughput improved is because we improved scalability: We reach the peak now at larger numbers of producers and consumers. For large objects, we're able to saturate the 24-core machine, and get peak throughput at 15 producer threads and 9 consumer threads.

Unfortunately, for small objects we peak too early and can only usefully harness about half of the available cores before contention is the limiting factor, and adding more producers and consumers, even on otherwise-idle cores, actually accomplishes less total work. For small objects, we don't even reach the dashed line. Even for large objects, contention among consumers still seems to be a limiting factor as shown by the thinness of the top half of the graph.

[Click image to view at full size]

Figure 2: Example 2 throughput (total items through queue per second).

Finally, note that for large objects, not only did we reach the dashed line with throughput still rising, but we spilled across it in the lower fewer-consumers region without losing much throughput. The cost of oversubscription is lessened, at least when adding extra producers; more producers doesn't get more total work done, but they don't negatively impact total work much either. Both contention and oversubscription are still bad, however, when we add extra consumers.

That's a good first step, but let's keep going.

Previous Page | 1 Example 1: Baseline | 2 Example 2: Shrinking the Consumer Critical Section | 3 Example 3: Reducing Head Contention | 4 Example 4: Padding To Avoid False Sharing | 5 A Word About Trendlines Next Page
TOP 5 ARTICLES
No Top Articles.



MICROSITES
FEATURED TOPIC

ADDITIONAL TOPICS

INFO-LINK