Work Stealing Queues in .Net 4 and in Previous Versions
.Net 4.0 Beta 1 offers a new and improved thread pool engine. It uses work stealing queues to provide nice load-balancing capabilities, better performance and greater scalability. The work stealing mechanism allows you to reduce the scheduling overhead in highly parallelized algorithms.
A well-known proverb says "A problem shared is a problem halved". In my previous post "Lightweight Concurrency: Threads are on a Diet", I began talking about the advantages of following James Reinders' eight key rules for multicore programming.
.Net Framework 4.0 with its Parallel Extensions, still in Beta 1, will add the possibility to work with Tasks. Thus, it will be easy to follow James Reinders' rule #3. However, it will require waiting for .Net Framework 4.0 final version. I've received many e-mails talking about the same situation: C# applications looking forward for task-based programming support in current stable .Net versions. I promised a new post offering a nice solution to this problem this week. I'll begin talking about .Net 4.0 work stealing advantages and then, I'll offer an alternative to implement something similar in .Net 2.0 or greater.
You already know the story. .Net 3.5 threading model has some limitations to create highly parallelized algorithms. The great problem is that it becomes really difficult to achieve great scalability without needing to create very complex code. The number of cores will go on increasing and you need more scalability. Work stealing queues offer a great alternative to reduce the locks and to schedule small work chunks without adding a significant overhead.
Synchronizing code running in multiple threads is indeed complex. Thus, a task-based approach offers an excellent alternative to leave some synchronization problems behind, especially those about work scheduling mechanisms.
.Net Framework 4.0 Beta 1 offers a new thread pool engine that uses a global queue and local queues. It uses work stealing queues (WSQ) to reduce the required locks. Hence, it reduces the scheduling overhead and improves overall performance. Besides, it improves scalability as the number of cores increases. With eight logical cores, you should expect at least eight threads running concurrently and scheduling tasks as necessary. This number should increase in the forthcoming years. Thus, the work stealing queues are appropriate for this new scenario.
Work stealing queues follow James Reinders' rule #5 "Avoid using locks. Simply say "no" to locks. Locks slow programs, reduce their scalability…". Therefore, the new thread pool engine reduces the number of locks and allows multicore developers to achieve greater scalabilities.
The task scheduler relies on the underlying thread pool engine. Thus, when you create a new task, it will use the steal working queues to find the most appropriate thread to enqueue it. The good news is that you'll see performance improvement in your threaded code even without using tasks.
However, there are a few problems. On the one hand, you have to wait for Visual Studio 2010 final version to take advantage of these improvements. On the other hand, you'll be able to use Parallel Extensions in neither C# 3.0 nor Silverlight 3. How can you implement a task-based approach in current C# versions?
Luckily, if you do like to research and you want to learn how things work in the background, there is a nice free implementation of a task-based work stealing threading library in Codeplex, ParallelTasks. It is a beta version, an open source homemade implementation of some of the features found in the Task Parallel Library. However, it is very interesting to read the code and to use it in your projects to implement a task-based approach. There are some issues to solve, but you'll be able to learn very interesting techniques from this project. Besides, you'll be able to use it in current C# versions, in Silverlight applications and in XNA Game Studio, among others.
If you need to code a task-oriented design and you cannot wait for Visual Studio 2010, you can try ParallelTasks and it will help you to find your way to Go Parallel.
Besides, you can find a nice demonstration of the work stealing queues on .Net 4.0's thread pool engine in this post from Daniel Moth.
You can read an in-depth analysis of work stealing queues in this great post from Parallel Extensions' lead developer and architect, Joe Duffy.
Multicore-enabling the N-Queens Problem Using Cilk++
Multicore Storage Allocation
The four basic problems that a good parallel storage allocator solvesQuickPath Interconnect: Rules of the Revolution
This Week's Multicore Reading List
- Intel Parallel Studio; Download the free eval today!
- Parallelism Breakthrough Video Series; Watch and learn more about Intel® Parallel Studio
- 2009 Intel Software Webinar Series; View On-Demand webinars
- Coding for Multi-core Processes; Intel® Compiler Pro eBook
- Performance Through Parallelism; Intel® Tuning for Vista eBook
- Intel® Software Network; Connect with developers and Intel engineers
-
November 17, 2009
Visual Effects for Animation - presented by DreamWorks Animation
Speaker: Ron Henderson (Bio)Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).
Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.Abstract:
In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot. -
December 1, 2009
A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.
Abstract:
Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code. -
December 15, 2009
How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.
Abstract:
Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.



