Coreinfo v2.0: A Simple Utility to Understand the Manycore Complexity in Windows

Windows Server 2008 R2 and Windows 7 (64-bits version) offer new NUMA (Non-Uniform Memory Access) support. Therefore, it is very important for Windows developers to understand the differences found in the complex underlying multicore and manycore hardware. Coreinfo is a very simple yet powerful command-line utility that shows you very useful information about the processors, their organization and the cache topology.

A few days ago, Mark Russinovich, a well-known member of Windows Sysinternals team made the new version v2.0 of Coreinfo available for download.

This command-line utility runs on most modern Windows versions and displays information about the mapping between logical cores (logical processors or hardware threads) and the physical cores. Besides, it shows information about the NUMA nodes, groups, sockets and all the cache levels. This information is very important to understand the underlying hardware. When you benchmark multicore performance, the great differences between many multicore architectures can make it really difficult to tune the application for a specific architecture. Using this command-line utility, you can easily save the information about the underlying hardware before running your benchmarks and performance tests.

The new version supports Windows Server 2008 R2 systems with more than 64 logical processors (logical cores or hardware threads). Besides, it is also compatible with IA-64 architectures.
You don't need to run an installer. You can unzip the executable file and run it from the command-line.

The utility uses the GetLogicalProcessorInformation Windows API function to obtain all the information displayed on the screen. Therefore, you can also obtain this information in your applications to tune performance according to the underlying hardware architecture. In fact, if you plan to create applications targeting manycore systems with multiple NUMA nodes, you'll have to take into account the detailed cache topology if you want to exploit the underlying hardware.

The results of running Coreinfo v2.0 on an Intel Atom N270 powered netbook are the following:
Logical to Physical Processor Map:
** Physical Processor 0 (Hyperthreaded)
Logical Processor to Socket Map:
** Socket 0
Logical Processor to NUMA Node Map:
** NUMA Node 0
Logical Processor to Cache Map:
** Data Cache 0, Level 1, 24 KB, Assoc 6, LineSize 64
** Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
** Unified Cache 0, Level 2, 512 KB, Assoc 8, LineSize 64

There is just one physical core. However, as this CPU offers Hyper-Threading technology, Coreinfo tells you it is Hyperthreaded.

The results of running Coreinfo v2.0 on an Intel Core 2 Duo P8600 powered notebook are the following:
Logical to Physical Processor Map:
*- Physical Processor 0
-* Physical Processor 1
Logical Processor to Socket Map:
** Socket 0
Logical Processor to NUMA Node Map:
** NUMA Node 0
Logical Processor to Cache Map:
*- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
*- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
-* Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
-* Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
** Unified Cache 0, Level 2, 3 MB, Assoc 12, LineSize 64

Coreinfo uses an asterisk "*" to represent a mapping. In this case, there are two physical cores and two logical cores as there isn't Hyper-Threading technology. Besides, there is a unified 3 MB Level 2 cache memory. Both physical cores share this cache, therefore, Coreinfo shows two asterisks "**" on the left side of the last line. This means that the cache is mapped to both processors:
*- =Physical Processor 0
-*=Physical Processor 1

Therefore, ** means Physical Processor 0 and Physical Processor 1.

The results of running Coreinfo v2.0 on an Intel Core 2 Quad Q6600 powered workstation are the following:
Logical to Physical Processor Map:
*--- Physical Processor 0
-*-- Physical Processor 1
--*- Physical Processor 2
---* Physical Processor 3
Logical Processor to Socket Map:
**** Socket 0
Logical Processor to NUMA Node Map:
**** NUMA Node 0
Logical Processor to Cache Map:
*--- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
*--- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
-*-- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
-*-- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
**-- Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64
--*- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
--*- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
---* Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
---* Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
--** Unified Cache 1, Level 2, 4 MB, Assoc 16, LineSize 64
Logical Processor to Group Map:
**** Group 0

In this case, there are four physical cores and four logical cores as there isn't Hyper-Threading technology. Besides, there are two unified 4 MB Level 2 cache memories. Each pair of physical cores share this cache, therefore, Coreinfo shows asterisks to identify the processors mapped to each cache:
*---=Physical Processor 0
-*--=Physical Processor 1
--*-=Physical Processor 2
---*=Physical Processor 3

Therefore, **-- means Physical Processor 0 and Physical Processor 1, and --** means Physical Processor 2 and Physical Processor 3.

These are the two lines that display the information about each unified cache mapped to each pair of physical processors:
**-- Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64
--** Unified Cache 1, Level 2, 4 MB, Assoc 16, LineSize 64

In the aforementioned examples, there is just one NUMA node. Some of the results of running Coreinfo v2.0 on a server powered by two quad-core AMD Opteron 2379 HE microprocessors with a NUMA architecture are the following:
Logical to Physical Processor Map:
*------- Physical Processor 0
-*------ Physical Processor 1
--*----- Physical Processor 2
---*---- Physical Processor 3
----*--- Physical Processor 4
-----*-- Physical Processor 5
------*- Physical Processor 6
-------* Physical Processor 7
Logical Processor to Socket Map:
****---- Socket 0
----**** Socket 1
Logical Processor to NUMA Node Map:
****---- NUMA Node 0
----**** NUMA Node 1

In this case, Coreinfo shows very useful mapping information related to NUMA nodes.

As it is a command-line utility, it is very simple to run it and redirect its output to a text file. For example:

coreinfo > cpudetails.txt

Saves all the information to the cpudetails.txt file.

The application offers many parameters to select the information to dump:
-c Dump information on cores.
-g Dump information on groups.
-l Dump information on caches.
-n Dump information on NUMA nodes.
-s Dump information on sockets.

You can download Coreinfo v2.0 here

Real World Parallelism Webinar Series
  • November 17, 2009
    Visual Effects for Animation - presented by DreamWorks Animation
    Speaker: Ron Henderson (Bio)

    Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).

    Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.

    Abstract:
    In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot.

  • December 1, 2009
    A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
    Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)

    Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.

    Abstract:
    Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code.

  • December 15, 2009
    How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
    Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)

    Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.

    Abstract:
    Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.