A Cute Technique for Avoiding Certain Race Conditions
Here's the problem: You have N particles. Particle j exerts force f(i,j) upon particle i. The usual action/reaction law holds, i.e., f(i,j) == -f(j,i). Compute the total force acting on each particle, in parallel. It turns out that the solution is nontrivial.
The natural sequential solution is something like this:
for (int i = 0; i < N; ++i)
for (int j = i; j < N; ++j) {
double t = f(i, j);
force[i] += t;
force[j] -= t;
}
The sequential implementation calls f(i,j) once to compute both the force f(i,j) exerted by particle j upon particle i, and the force -f(i,j) exerted by i upon j. Calling f once is better than calling f twice, especially because f tends to be expensive to compute (e.g., computing gravitational forces involves a division and a square root.)
The serial solution does not parallelize easily, however. You cannot execute the outer loop in parallel, because then we have a race condition among parallel updates of force[j] for different values of i. For a similar reason, parallelizing the inner loop does not work either.
If you are willing to call f twice, both both as f(i,j) and as f(j,i), then a simple parallel implementation looks like this:
cilk_for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) force[i] += f(i, j);
That is, for each particle i (in parallel) compute (sequentially) the force exerted upon i by all other particles j. The inner loop is sequential, which avoids the race condition on the update of force[i]. However, this parallel solution calls f twice as much as the serial one, which is not good.
So how do you do it? My solution appears below. It calls f only once for every pair (i,j), for i <= j, so it is does not double the work, and it is race-free. It is a complete Cilk++program that you can run and certify to be race-free by using Cilkscreen, a race-detecting tool .
#include
#define N 749 // number of particles
double force[N];
double f(int i, int j)
{
// compute your favorite force here. We use a constant
// force 1.0 for illustration purposes
return 1.0;
}
void basecase(int i, int j)
{
assert(i <= j);
double t = f(i, j);
force[i] += t;
force[j] -= t;
}
/* traverse the rectangle i0 <= i < i1, j0 <= j < j1 */
void rect(int i0, int i1, int j0, int j1)
{
int di = i1 - i0, dj = j1 - j0;
if (di > 1 && dj > 1) {
int im = i0 + di / 2;
int jm = j0 + dj / 2;
cilk_spawn rect(i0, im, j0, jm);
cilk_spawn rect(im, i1, jm, j1);
cilk_sync;
cilk_spawn rect(i0, im, jm, j1);
cilk_spawn rect(im, i1, j0, jm);
} else {
for (int i = i0; i < i1; ++i)
for (int j = j0; j < j1; ++j)
basecase(i, j);
}
}
/* traverse the triangle n0 <= i <= j < n1 */
void interact(int n0, int n1)
{
int dn = n1 - n0;
if (dn > 1) {
int nm = n0 + dn / 2;
cilk_spawn interact(n0, nm);
cilk_spawn interact(nm, n1);
cilk_sync;
rect(n0, nm, nm, n1);
} else if (dn == 1) {
basecase(n0, n0);
}
}
int cilk_main()
{
interact(0, N);
}
In this program, the sequential program is equivalent to calling basecase(i,j) for all 0 <= i <= j < N. Another way to look at it is that we call basecase(i,j) for all points (i,j) in the triangle 0 <= i <= j < N, which looks like the shaded area in this figure:
Procedure interact traverses this triangle in parallel, and in fact it is a little bit more general, because it traverses triangles of the form n0 <= i <= j < n1. Initially, we set n0 = 0 and n1 = N in cilk_main.
Procedure interact works by recursively partitioning the triangle. If the triangle consists of only one point, then it visits the point (n0, n0) directly. Otherwise, the procedure cuts the triangle into one rectangle and two triangles, like this:
The two smaller triangles can be executed in parallel, because one consists only of points (i,j) such that i < nm and j < nm, and the other consists only of points (i,j) such that i >= nm and j >= nm. Thus, the two triangles update nonoverlapping regions of the force array, and thus they do not race with each other. However, the rectangle races with both triangles, and thus we need the cilk_sync statement before processing the rectangle.
To traverse a rectangle we use procedure rect, which also works recursively. Specifically, if the rectangle is large enough, the procedure cuts the rectangle i0 <= i < i1 , j0 <= j < j1, into four smaller subrectangles.
The amazing thing is that the two black subrectangles can be traversed in parallel with each other without races. Similarly, the two gray subrectangles can be traversed in parallel with each other without races. However, the black subrectangles race with the gray, so we must use a cilk_sync statement after processing the first pair of subrectangles.
To see why there are no races between the two black subrectangles (the same argument applies to the grey) observe that the i -ranges of the two subrectangles do not overlap, because one is smaller than im and the other is larger. For the same reason, the j -ranges do not overlap either. In order for races not to occur, however, we must also prove that the i -range of one subrectangle does not overlap with the j -range of the other, because the base case updates both force[i] and force[j]. This property holds because when interact calls rect initially, the i -range is n0 <= i < nm, whereas the j -range is nm <= j < n1 , so the two ranges never overlap.
This Week's Multicore Reading List
MATLAB and Google App Engine
Logging In C++ : Part 2
Improving log granularityA Conversation with BitMagic's Developer
Prefer Structured Lifetimes: Local, Nested, Bounded, Deterministic
- Intel Parallel Studio; Download the free eval today!
- Parallelism Breakthrough Video Series; Watch and learn more about Intel® Parallel Studio
- 2009 Intel Software Webinar Series; View On-Demand webinars
- Coding for Multi-core Processes; Intel® Compiler Pro eBook
- Performance Through Parallelism; Intel® Tuning for Vista eBook
- Intel® Software Network; Connect with developers and Intel engineers
-
November 17, 2009
Visual Effects for Animation - presented by DreamWorks Animation
Speaker: Ron Henderson (Bio)Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).
Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.Abstract:
In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot. -
December 1, 2009
A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.
Abstract:
Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code. -
December 15, 2009
How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.
Abstract:
Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.



