Multicore Storage Allocation
When multicore-enabling a C/C++ application, it's common to discover that malloc() (or new) is a bottleneck that limits the speedup your parallelized application can obtain. This article explains the four basic problems that a good parallel storage allocator solves:
- Thread safety
- Overhead
- Contention
- Memory drift
Thread Safety
Basic storage allocators are not thread safe, although recent efforts have started to remedy this problem for many concurrency platforms. In other words, improper behavior due to races on the storage allocator's internal data structures can result from two parallel threads attempting allocate or deallocate at the same time. When threads have unrestricted access to the storage allocator, as shown below, they may end up "stomping on each others' toes," leading to anomalous behavior.
The simple solution to this problem is for applications to acquire a mutex (mutual exclusion) lock on the allocator before calling malloc() or free(), as illustrated below, which lets only one thread access the allocator's internal data structures at a time.
If the storage allocator is thread safe, the locking protocol is incorporated into the logic of the storage allocator itself.
Overhead and Contention
Two problems may arise when an allocator is made thread safe by locking. The first is that allocation and deallocation may now be slower due to the overhead of locking. The second is that contention may arise in accessing the storage allocator, which can slow down the application and limit its scalability. Contention may not be a big problem for 2 or 4 cores, but as Moore's Law brings us dozens and even hundreds of cores per chip, contention can threaten scalability.
Both problems can be solved using a distributed allocator, which provides a local storage pool per thread, as illustrated below.
A distributed allocator allows allocation and deallocation to run out of the local storage pool most of the time. In the uncommon case that a thread's local pool is exhausted, the thread can obtain additional storage, typically in large blocks, from the global pool. The contention problem is solved, because threads only rarely access the global pool. The overhead problem is solved as well, because no locking is needed to access the local pool.
Memory Drift
Unfortunately, local pools introduce yet another problem, especially in concurrency platforms where storage is actively shared among threads or which load-balance a computation across the threads. One thread A may continually allocate storage out of its local pool and pass it off to another thread B which frees it into its local pool. When thread A's local pool runs out, it allocates more storage from the global pool. This storage is passed to B, which proceeds to free it into its local pool. Over time, B's local pool grows unboundedly, creating something akin to a memory leak, where the virtual-memory footprint of the application continues to grow.
This memory drift problem can be solved in two ways. One solution is for a thread whose local pool becomes too large to return some of its storage to the global pool. The other is for all threads to return storage to the thread pool where the storage was allocated. Either method can be implemented with low overhead, and both provide satisfactory solutions to the memory drift problem.
Conclusion
There are other problems that can arise with parallel storage allocators. For example, false sharing is a particularly pernicious problem, where two threads access independent blocks of storage that happen to lie on the same cache line, leading to a thrashing of the cache coherency protocol in the processor. A storage allocator that fails to respect cache line boundaries and gives blocks of storage that share the same cache line to different threads may induce false sharing, which is hard to detect, because the logic of the code shows that the threads are accessing independent locations.
Two examples of parallel storage allocators include Hoard, written by Emery Berger of the University of Massachusetts, and the Miser allocator, distributed by Intel as part of the Cilk++ distribution.
This Week's Multicore Reading List
MATLAB and Google App Engine
Logging In C++ : Part 2
Improving log granularityA Conversation with BitMagic's Developer
Prefer Structured Lifetimes: Local, Nested, Bounded, Deterministic
- Intel Parallel Studio; Download the free eval today!
- Parallelism Breakthrough Video Series; Watch and learn more about Intel® Parallel Studio
- 2009 Intel Software Webinar Series; View On-Demand webinars
- Coding for Multi-core Processes; Intel® Compiler Pro eBook
- Performance Through Parallelism; Intel® Tuning for Vista eBook
- Intel® Software Network; Connect with developers and Intel engineers
-
November 17, 2009
Visual Effects for Animation - presented by DreamWorks Animation
Speaker: Ron Henderson (Bio)Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).
Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.Abstract:
In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot. -
December 1, 2009
A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.
Abstract:
Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code. -
December 15, 2009
How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.
Abstract:
Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.



