TMonitor: Understanding What Happens With Each Hardware Thread

TMonitor, a new tool developed by the CPUID team, offers the possibility to understand what's going on with each hardware thread (logical core) on some modern multicore microprocessors.

Very old hardware running a single sequential code application was easier to understand than modern hardware running many applications executing dozens of software threads distributed into the available hardware threads (logical cores).

Parallelized code creates many tasks (the packages), stealing work from many threads (the cars), running on hardware threads (logical cores, the lanes). It sounds simpler using packages, cars and lanes to explain the main software and hardware layers involved in the execution of parallelized code. In fact, there is indeed much more than this. However, I'll keep the focus on the packages, the cars and the lanes.

How long is it going to take to travel 800 miles with 4 packages, using 4 cars (1 package in each car)? It depends on three variables, the number of available lanes, the cars' maximum speed and the maximum speed limit for each lane. I'll assume the cars' drivers are going to respect the maximum speed limits. It also depends on other variables. However, I'll keep the focus on these three variables.

There are many problems:

  • Some lanes aren't completely independent lanes. They share some regions with other lanes (Hyper-Threading technology).
  • The maximum speed limit for each lane could be reduced or increased (Energy saving schemes, Enhanced Intel SpeedStep Technology and Intel Turbo Boost Technology among others).
  • The cars' speed isn't constant. It changes because the cars' drivers find some traffic jams on the roads (operating system's scheduler decisions, sharing hardware resources, concurrency problems, inefficient code and I/O bottlenecks among others).

As you may guess, these problems happen in nanoseconds. Therefore, it is very important to understand modern parallel hardware in order to create efficient parallelized code.

There is a new tool, developed by the CPUID team, TMonitor, still in beta version, that allows you to display the active clock of each individual hardware thread (logical core) of a multicore microprocessor. It displays a graph showing the maximum speed limit for each lane, as shown in the following picture for a quad-code microprocessor (four physical cores without Hyper-Threading technology, four logical cores, four hardware threads):

TMonitor displaying 4 idle hardware threads (all frequencies = 2,400 MHz = 2.4 GHz).

TMonitor uses a very high refresh rate (20 times per second), therefore, it allows you to small clock variations for each hardware thread in real-time.

TMonitor displaying 1 hardware thread with its increased clock (one of the frequencies = 2800 MHz = 2.8 GHz).

TMonitor displaying 4 hardware threads with their increased clocks (all frequencies = 2,800 MHz = 2.8 GHz).

It can show you what's going on with the hardware threads. You can see the different maximum speed limits while parallelized applications are running. One of its interesting features is the possibility to detect Intel's Turbo Boost activation for each hardware threads.

The application is very simple to download and run. It comes in both 32-bits and 64-bits versions for Windows. This beta version has some limitations. It works only on Intel Core 2 and Core i3; i5 and i7 microprocessors. However, taking into account the other excellent free tools developed by the CPUID team, you can expect support for many other microprocessors soon.

The next time you want to understand what's going on with your parallelized code, you can use TMonitor to have more information about the underlying hardware. This way, you'll be able to understand why some small changes in the code could produce very different performance results.

Don't forget about maximum speed limits, packages, cars and lanes.

Real World Parallelism Webinar Series
  • November 17, 2009
    Visual Effects for Animation - presented by DreamWorks Animation
    Speaker: Ron Henderson (Bio)

    Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).

    Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.

    Abstract:
    In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot.

  • December 1, 2009
    A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
    Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)

    Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.

    Abstract:
    Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code.

  • December 15, 2009
    How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
    Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)

    Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.

    Abstract:
    Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.