Intel Threading Building Blocks: parallel_for()

Tools
  • Print
  • Email
Time your code, find where concurrent processing fits in, sniff out shared data points and you're ready to roll with parallel_for()

Concurrency needn't be so complicated that you avoid it completely. One of the easiest ways to gain performance increases on multi-core platforms is with the parallel_for algorithm. To get a sense of how real-world developers are using Intel Threading Building Blocks, we spoke with Vincent Tan, a programmer with Pongrass Australia.

As described on the Intel Software Network, Tan created a multithreaded version of par2cmdline 0.4, a utility commonly used to repair corrupted Usenet postings via Reed Solomon coding. By leveraging the Intel Threading Building Blocks 2.0 library (using TBB's mutex, concurrent_hash_map, atomic, and parallel_for constructs), the program can process files concurrently instead of serially. As a result, dual-core machines can nearly double performance time when creating or repairing data files.

Q: How did you learn about parallel_for?

A: I read the Intel TBB tutorial and reference manuals. From there, I looked at the sample code.

Q: Was it easy to add the algorithm to your application?

A: After studying the sample code, it was straightforward to convert the code. The harder part was finding all of the shared resources (such as member variables) and then ensuring that access to them was thread-safe.

Q: Did you make any mistakes?

A: I originally specified a grain size, but I found that it did not really help (because the TBB's default behavior was good enough for the code to which I tried to apply the grain size).

A: What would be the most interesting use for this algorithm?

A: To be honest, I view it as a tool to solve a particular problem. The obvious for loops in the project's code pretty much dictated the use of parallel_for. I'll put it another way: If you can process elements of a random-accessible array in parallel (i.e., the elements have no interdependencies) then parallel_for is the tool you probably want.

A: What performance or productivity benefits did you gain?

A: CPU utilization on a dual-core machine went from approximately 40-45 percent to approximately 80-85 percent. Because I/O is still performed serially (non-overlapped), the code never achieves 100 percent utilization -- but a doubling of performance is good enough for most users.

A: How should a developer get started with parallel_for?

A: Read the Intel TBB tutorial on the Documentation page of threadingbuildingblocks.org and study the sample code. The reference manual helps out with the nitty-gritty details but you'll probably only need it if you need to specify the grain size.

Here's a snippet of parallel_for at work in the par2cmdline source code.

// par2creator.cpp::973 // New function to hold the original loop body void ProcessData(u32 outputblk, u32 endindex, size_t blklength, u32 inputblk) { for( ; outputblk != endindex; ++outputblk ) { // Select the appropriate part of the output buffer void *outbuf = &((u8*)outputbuf)[chunksize * outputblk]; // Process the data through the RS matrix rs.Process(blklength, inputblk, inputbuf, outputblk, outbuf); } } // Encapsulates the loop body class ApplyRSProcess { public: ApplyRSProcess(Par2Creator* obj, size_t blklength, u32 inputblk) : _obj(obj), _blklength(blklength), _inputblk(inputblk) {} void operator()(const tbb::blked_range<u32>& r) const { _obj->ProcessData(r.begin(), r.end(), _blklength, _inputblk); } private: Par2Creator* _obj; size_t _blklength; u32 _inputblk; };

Real World Parallelism Webinar Series
  • November 17, 2009
    Visual Effects for Animation - presented by DreamWorks Animation
    Speaker: Ron Henderson (Bio)

    Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).

    Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.

    Abstract:
    In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot.

  • December 1, 2009
    A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
    Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)

    Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.

    Abstract:
    Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code.

  • December 15, 2009
    How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
    Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)

    Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.

    Abstract:
    Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.