Use Thread Pools Correctly: Keep Tasks Short and Nonblocking

Tools
  • Print
  • Email
How many threads must a thread pool pool? For a thread pool must pool threads.

Herb Sutter is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca


What are thread pools for, and how can you use them effectively? As shown in Figure 1, thread pools are about letting the programmer express lots of pieces of independent work as a "sea" of tasks to be executed, and then running that work on a set of threads whose number is automatically chosen to spread the work across the hardware parallelism available on the machine (typically, the number of hardware cores [1]). Conceptually, this lets us execute the tasks correctly one at a time on a single-core machine, execute them faster by running four at a time on a four-core machine, and so on.

Figure 1: Thread pools are about taking a sea of work items and spreading them across available parallel hardware.

Besides scalable tasks, one other good candidate of work to run on a thread pool is the small "one-shot" thread. This is work that we might ordinarily express as a separate thread, but that is so short that the overhead of creating a thread is comparable to the work itself. Instead of creating a brand new thread and quickly throwing it away again, we can avoid the thread creation overhead by running the work on a thread pool, in effect playing "rent-a-thread" to reuse an existing pool thread instead. (See [2] for more about using threads correctly, including running small threads as pool work items.)

But the thread pool is a leaky abstraction. That is, the pool hides a lot of details from us, but to use it effectively we do need to be aware of some things a pool does under the covers so that we can avoid inadvertently hitting performance and correctness pitfalls. Here's the summary up front:

  • Tasks should be small, but not too small, otherwise performance overheads will dominate.
  • Tasks should avoid blocking (waiting idly for other events, including inbound messages or contested locks), otherwise the pool won't consistently utilize the hardware well -- and, in the extreme worst case, the pool could even deadlock.

Let's see why.

Tasks Should Be Small, but Not Too Small

Thread pool tasks should be as small as possible, but no smaller.

One reason to prefer making tasks short is because short tasks can spread more evenly and thus use hardware resources well. In Figure 1, notice that we keep the full machine busy until we start to run out of work, and then we have a ragged ending as some threads complete their work sooner and sit idle while others continue working for a time. The larger the tasks, the more unwieldy the pool's workload is, and the harder it will be to spread the work evenly across the machine all the time.

On the other hand, tasks shouldn't be too short because there is a real cost to executing work as a thread pool task. Consider this code:

// Example 1: Running work on a thread pool. pool.run( [=] { SomeWork(); } );

By definition, SomeWork must be queued up in the pool and then run on a different thread than the original thread. This means we necessarily incur queuing overhead plus a context switch just to move the work to the pool. If we need to communicate an answer back to the original thread, such as through a message or Future or similar, we will incur another context switch for that. Clearly, we aren't going to want to ship int result = int1 + int2; over to a thread pool as a distinct task, even if it could run independently of other work. It's just like the sign you see in a theme park at the entrance to the roller coaster: "You must be at least this big to go on this ride."

So although we like to keep thread pool tasks small, a task should still be big enough to be worth the overhead of executing it on the pool. Measure the overhead of shipping an empty task on your particular thread pool implementation, and as a rule of thumb, aim to make the work you actually ship an order of magnitude larger.

PAGE 1 | 2 | 3 Next Page »

Real World Parallelism Webinar Series
  • November 17, 2009
    Visual Effects for Animation - presented by DreamWorks Animation
    Speaker: Ron Henderson (Bio)

    Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).

    Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.

    Abstract:
    In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot.

  • December 1, 2009
    A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
    Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)

    Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.

    Abstract:
    Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code.

  • December 15, 2009
    How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
    Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)

    Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.

    Abstract:
    Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.