Scaling Ambient Animations

Tools
  • Print
  • Email
Putting threaded animation to work

Mike Yi is a software engineer in the Intel Visual Computing Software Division. Orion Granatir is a senior software engineer in the Intel Visual Computing Software Division.


Game developers want to deliver the best experience possible for each player, but they also want a game that is fair to all players. A higher-performing machine for one player can and should lead to a better game experience, but not a gameplay advantage in a multi-player situation. Many solutions to this dilemma exist; one approach is to use the extra power to render more frames. Another approach is to take incidental effects and amplify them on multi-core machines. This leaves gameplay consistent across all computer platforms, but rewards those with higher-end systems.

This article introduces a demo called Horsepower that shows such a technique: enhanced ambient animation when run on a multicore CPU. The source code for Horsepower is free to use and improve.

Horsepower started with the code base of the Intel Smoke demo, which features a multi-threaded game framework with tremendous flexibility. Smoke is a free application and its source code can be downloaded. Most of the existing Smoke systems were carried over to Horsepower. Smoke already contained a threaded AI system so this capability was migrated to Horsepower. The Smoke framework also utilizes Havok Physics, which is threaded, and Horsepower benefits from this as well.

Figure 1: On an Intel Core i7 processor-based system using eight threads, we can maintain almost 600 horses.

Figure 2 On the same Intel Core i7 processor-based system using four threads-which means no Intel Hyper-Threading Technology-we drop to 570 horses. So simultaneous multi-threading gives us about 20 additional horses by adding logical cores (and not adding any physical cores).

Figure 3 When we drop to two threads, and therefore are only using two of the four physical cores and no Intel Hyper-Threading Technology, we can maintain roughly 235 horses.

Horsepower was created to showcase a perceptible difference for users with a multicore CPU. By threading to take advantage of multicore CPU power, a few horses running in a field are multiplied to hundreds of horses (Figures 1-3). Dynamically adjusting the number of horses drawn on-screen maintains a consistent frame rate of 30 frames per second (fps). The demo decreases the number of horses drawn on screen until 30 fps is hit and maintained and, on the flip side, increases the number of horses on screen until the target 30 fps is hit. Although the horses are the main animated objects in this demo, you can use this technique on any ambient animations in a game for a better game experience.

In this article, we introduce the technique of threading an animation system, and you'll see how to apply it to your ambient animations. We'll show you the code we built to explore the idea, show how we changed an animation system so it would scale as desired, and describe some issues you might find as you try this in your game. Your gamers will thank you for providing a richer gaming experience!

o achieve this enhanced performance, the entire animation process must be highly parallel. The code uses OGRE 3D, with a custom-threaded version of OGRE's animation system. Although some documentation exists that details the threaded animation system already available in OGRE, the performance of the animation system was not acceptable for this demo. Optimization of the existing code was necessary.

Figure 4 OGRE animation system: single-threaded case.

Let's start with a simple overview of how animation normally works in OGRE, shown in Figure 4.

The demo calls renderOneFrame in OGRE once per frame. This loops through all the entities in the scene and updates the animation by calling updateAnimation. When OGRE runs updateAnimation, it calculates all the vertex positions based on the current frame of animation. This update includes calculating bone positions, blending animations, and applying weight maps.

OGRE updates all of the entities in serial order, using a single thread as illustrated in Figure 5. This presents a unique opportunity to increase performance by introducing multi-threading. OGRE's animation is ideal for threading because:

  • There are many calculations
  • An entity's animation doesn't affect other entities
  • The work can be easily pulled out of OGRE

For Horsepower, OGRE's animation system was modified to distribute the updates of all the entities across multiple threads (the work shown in the OGRE block in Figure 4), increasing performance. All of the "adjustments" (change or advance animations) could also be threaded, and, fortunately, because Horsepower is built on top of Smoke, the threading benefits that were already in place were taken advantage of.

Figure 5 In single-threaded mode, we can maintain only about 87 horses.

Figure 6 shows how things look once the update is parallelized.

Since the Adjust step is already parallel, the demo calls updateAnimation directly for each entity after it adjusts the animation. Previously, Adjust happened after the update, effectively moving the adjustments to the next frame. In the threaded case, Adjust needs to be called prior to updateAnimation, or there will be nothing to update and the call will return early. Because there are so many animated objects, the difference between these two approaches results in the same net effect, so the order of the Adjust and update is of no concern.

When first implemented, this change crashed the demo for two reasons:

  • The demo accessed OGRE and DirectX from multiple threads.
  • OGRE does not support multiple readers of a hardware buffer.

While exploring solutions to these problems, we discovered that OgreConfig.h contains a pre-processor macro called OGRE_THREAD_SUPPORT. If OGRE_THREAD_SUPPORT is defined, OGRE supports multi-threaded access and also initializes DirectX in multi-thread mode (the DX device is created with the D3DCREATE_MULTITHREADED flag). Defining this macro resolved the first issue.

,p>Resolving the second issue required more insight. In Horsepower, all of the horses are based on the same mesh. The data for the mesh are loaded only once for all of the horses. To animate the horses, each entity has to access this shared mesh data. Because DirectX supports multiple readers and OGRE does not, OgreHardwareBuffer needed to be modified to support this functionality. The OgreHardwareBuffer changes can be viewed in the source code's OgreHardwareBuffer.h file (in the path code\extern\Ogre1_9\OgreMain\include). Those two changes to OGRE were sufficient to enable threaded animation in the demo.

Horsepower uses a unique performance metric: horse count. On any system, with a locked frame rate of 30 fps, the number of horses displayed will indicate the relative performance of the system at hand. On an Intel Core i7 processor with eight-thread capability, the data shown in Figure 7 was captured. Possible future work on the Horsepower demo includes:

  • Instanced horses-Horsepower currently shares the same mesh data, but each object is its own entity with its own animation vertices; performance could be improved with instancing.
  • An optimized level of detail (LOD) system-LOD was removed because of performance reasons; work can be done for further performance optimization and higher level of detail with LOD optimization.
  • Possibly turning it into a herding game. The framework already has "fear" programmed into the demo (from its Smoke roots). Have the horses fear the camera and create a pen into which the horses are herded.
  • The possibilities are endless, and with the source code free for use, anyone can try anything out!

Horsepower's primary goal is to show a fair, perceptible difference in effects through the use of threaded animation. We hope this example will encourage developers to make use of the extra compute power on multi-core CPUs in real PC games.

Real World Parallelism Webinar Series
  • November 17, 2009
    Visual Effects for Animation - presented by DreamWorks Animation
    Speaker: Ron Henderson (Bio)

    Ron Henderson manages the FX Tools group at DreamWorks Animation, where he is responsible for developing physical simulation and procedural modeling tools. These systems have been used for key visual effects in recent films such as Kung Fu Panda and Monsters vs. Aliens (March 2009).

    Prior to joining DreamWorks in 2002 he was a senior scientist at Caltech with a joint appointment to the Applied Math and Aeronautics departments, where he worked on efficient techniques for the direct numerical simulation of fluid turbulence.

    Abstract:
    In this webinar, Ron Henderson will show examples of visual effects, from hair and feathers to smoke and fire, from a variety of DreamWorks Animation feature films. He will discuss in general terms the kinds of techniques used to achieve particular visual effects. Finally, Henderson will show a detailed breakdown of the dam-breaking scene from Madagascar: Escape 2 Africa, demonstrating how different elements of key frame animation, simulation, and rendering are combined in a real production shot.

  • December 1, 2009
    A Quick and Easy Way to Parallelize a Legacy Codebase with Intel® Threading Building Blocks (TBBs)
    Speaker: Bernard Laberge, Avid, Senior Principal Engineer (Bio)

    Bernard Laberge is a senior principal engineer in the video editors division at Avid. During his seven years with the company he has been actively involved in the replacement of the legacy video processing engines used by Avid editors with a common hardware-abstracted, component-based video processing engine currently running on the CPU with SIMD optimized code, GPU, and dedicated hardware.

    Abstract:
    Learn how to overcome the limitations of a thread-based scheduler, including dealing with the absence of recursive parallelism support and the inefficient handling of unbalanced processing load. Bernard Laberge addresses how Avid resolved the expensive refactoring of their thread-based scheduler into a task-based solution by choosing Intel® Threading Building Blocks (TBBs). He explores how Avid was able to easily integrate the Intel TBBs into their video editor applications and more than 5 million lines of code.

  • December 15, 2009
    How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
    Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)

    Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.

    Abstract:
    Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.