Intel's Hyper-Threading Strikes Back

Intel's Hyper-Threading technology was introduced with the 3.06 GHz Pentium 4 microprocessor. A few years later, the new Intel Core i7 processors offer Hyper-Threading again.

I must say that I had worked with Hyper-Threading technology in the Pentium 4 era and I really liked it. Therefore, I am happy that it is back in the new microprocessors. Why? Because it usually offers a very interesting performance improvement in applications that were parallelized taking into account its possible existence.

Without Hyper-Threading, a Core i7 microprocessor is a quad-core CPU offering four (4) physical processing cores. With Hyper-Threading enabled, it offers eight (8) logical processing cores (4 x 2 = 8). This means that an operating system prepared for SMP (Symmetrical MultiProcessing) will believe that the CPU has eight processing cores.

So, here is the question: Should I use eight (8) threads, according to the total number of logical processing cores or should I use four (4) threads, according to the total number of physical processing cores? If the application was designed taking into account the possibility of running on a multi-core microprocessor with Hyper-Threading technology, it should use eight (8) threads and there should be an important performance improvement.

One of the techniques to parallelize an algorithm is to transform a sequential code that is going to be run many times into a pipelined producer-consumer waterfall. This technique is very useful when you want to use multiple cores simultaneously but you do not have a perfect symmetry in the time needed for each step. Therefore, you use concurrent collections or lists to create a chain of independent producer-consumers. Many modern programming languages offer high-level structures to work with pipelines, concurrent collections or lists. Some will offer them in future versions, like Java with JDK 7 and C# with .Net 4.0 and its Parallel Extensions). Hence, it will be simpler than ever to create this complex but highly scalable pipelines.

If you run this kind of applications in a microprocessor with Hyper-Threading technology, you will be able to see a great performance improvement. Why? Because the duplication of the processing cores will allow you to run twice the number of threads and this will allow the pipeline to have more operations running in parallel. Besides, it will help to solve the asymmetries generated by the complexity of the algorithm.

Hyper-Threading technology works great when the data necessary for each pair of threads is available in the cache shared by each pair of logical cores – the same physical core. A pipelined producer-consumer waterfall usually has the data in cache to begin working in the next producer-consumer step. However, the data flow decomposition used in this kind of designs requires special care to eliminate startup and shutdown latencies. The good news is that Hyper-Threading technology solves this problem and allows the code to run more concurrently. Hence, it can show performance improvements of more than 50% using one thread per logical core (eight threads in a quad-core i7 processor). Thus, it is nice to see Hyper-Threading is back. Good news for the parallelization fans.

If you are interested in taking full advantage of Hyper-Threading, I have two recommended articles talking about pipeline design:
* Fundamental Concepts of Parallel Programming by Shameem Akhter and Jason Roberts.
* The Challenges of Developing Multithreaded Processing Pipelines by Ryan Bloom

Real World Parallelism Webinar Series
  • November 17, 2009
    Visual Effects for Animation - presented by DreamWorks Animation
    Speaker: Ron Henderson (Bio)

    Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).

    Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.

    Abstract:
    In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot.

  • December 1, 2009
    A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
    Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)

    Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.

    Abstract:
    Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code.

  • December 15, 2009
    How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
    Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)

    Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.

    Abstract:
    Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.