Going Parallel: Part 3: Let's Get Started!
When I first dipped my toes into parallel programming I hunted around for a legacy application to change from serial to parallel. Rather than work on one of my own apps, I looked for a serial application on the web that had lots of CPU activity. I ended up choosing the Dhrystone benchmark. It was a great learning experience. In the next few blog entries I'm going to try and use the same code and go through the process of making it parallel. I do remember that the first time I did this It took me quite a number of sessions to get the job completed. Lot's of silly errors. Learning to use the tools at the same time as learning how to add the parallelism took more time than I expected. On one hand, I was proud that I had made a program parallel. On the other hand, I was a little disappointed that it took me so much time to parallelize a program not that much more complicated than hello world.In this part of the going parallel blog I'll show how I identified the where to parallelise
Links to the previous blog entries
Going Parallel: Part 1: Doing Two Things at Once - Impossible!
Going Parallel: Part 2: So Who's Really Writing Parallel Applications?
The Application
So, on with the show! I'm curious as to how I'll get on. I'm a lot wiser than when I first did this exercise in the dim and distant past. Also, there are many more tools now available to help in the task -- not least the upcoming Intel Parallel Studio. First thing was to get the source code which I obtained from http://www.netlib.org/benchmark/dhry-cc. The file is a self-extracting shell script. If you don't have a shell available on your windows machine, you can get a copy of my extracted and modified files here dhry_1.c, dhry_2.c, dhry.h
Building the Code
I'm using the latest released version of the Intel Compiler (v11.0.72). Building was straightforward:
icl /ZI /DTIME dhry_1.c dhry_2.c /o intel.exe
I also built a Microsoft version:
cl /Zi /DTIME dhry_1.c dhry_2.c /o ms.exe
Modifying the Source Code
I modified the code so I could pass in the loop counter at the command prompt. The code snippet below shows my additions in bold:
main (
int argc, char * argv[]).
.
.
if
(argc > 1){
Number_Of_Runs = atoi(argv[1]);
}
else
{
printf ("Please give the number of runs through the benchmark: ");
{
int n;
scanf ("%d", &n);
Number_Of_Runs = n;
}
printf ("\n");
}
Why Does the Intel-built Application Run So Much Faster?
The first thing I noticed is that the executable built with the Intel compiler runs much faster than the Microsoft version. When building with Intel compiler and the /O2 option we get a speedup of 3.
For the moment I'm going to use the /Od options and get on with the task of making the program parallel. In a subsequent blog I'll do some analysis of the speedup. My suspicion is that the number of calculations has been reduced by the optimisation process -- which strictly speaking is not what I want for the benchmark.
| Compilerer | Optimisation Flag | Num Loops | Dhrystones per Second | Improvement Ratio WRT Row 1 |
| Microsoft | /O2 (for speed) | 10000000 | 1.6 M | 1 |
| Intel | /O2 (for speed) | 10000000 | 5.0 M | 3 |
| Microsoft | /Od (no optimisation) | 10000000 | 1.6 M | 1 |
| Intel | /Od (no optimisation) | 10000000 | 1.6 M | 1 |
Finding the Hotspot
I'm going to use Intel VTune to determine where the Hot spots are. The next few screens show the loading and running of the program under VTune. In the next blog I'll use Intel Parallel Amplifier to look for hotspots.
Looking at the number of clock-tick samples shows that there are a number of functions that contribute to the activity of the application. In some applications it maybe that just one or two functions are "hot". In this code there are quite a number of functions.
Strategy for Next Steps
It looks to me like the best strategy would be to parallelise the high-level loop in main.
for (Run_Index = 1; Run_Index <= Number_Of_Runs; ++Run_Index){
In the next blog, I'll introduce some parallelism using OpenMP, then try to discover what pitfalls I'm falling into.
This Week's Multicore Reading List
MATLAB and Google App Engine
Logging In C++ : Part 2
Improving log granularityA Conversation with BitMagic's Developer
Prefer Structured Lifetimes: Local, Nested, Bounded, Deterministic
- Intel Parallel Studio; Download the free eval today!
- Parallelism Breakthrough Video Series; Watch and learn more about Intel® Parallel Studio
- 2009 Intel Software Webinar Series; View On-Demand webinars
- Coding for Multi-core Processes; Intel® Compiler Pro eBook
- Performance Through Parallelism; Intel® Tuning for Vista eBook
- Intel® Software Network; Connect with developers and Intel engineers
-
November 17, 2009
Visual Effects for Animation - presented by DreamWorks Animation
Speaker: Ron Henderson (Bio)Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).
Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.Abstract:
In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot. -
December 1, 2009
A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.
Abstract:
Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code. -
December 15, 2009
How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.
Abstract:
Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.



