Parallel

Managing Multi-Core Projects, Part 1: HPC on the Parallelism Frontier

By Steve Apiki, December 16, 2008

HPC may be on the leading edge, but key advice like going parallel early, thinking strategically, and spreading knowledge throughout the team applies to all development managers. This is Part 1 of a 6-part series on managing multi-core development projects.

HPC was parallel when parallel wasn't cool. In high-performance computing, where developers have long experience with parallel computing and where large clusters are often the target platform, the multi-core driven concurrency revolution isn't catching anyone by surprise. From these pioneers, we can learn that parallelism makes a competitive difference -- and that it doesn't happen overnight.

In HPC, as in other software, all signs point to increasing parallelism as the surest path to improved system performance and to competitive advantage. Just when you were warming up to programming SMP within the nodes of a cluster, asymmetry between types of core processors will add additional complexity. But whatever the architecture of the next generation system, whether SMP (symmetric multiprocessing) systems based on multi-core processors, FPGA supercomputers, hybrid GPU designs, or other asymmetric configurations, all can be approached with some proven principles for managing parallel software projects.

Parallelism is a defining feature of HPC, as are large data sets and long run times, sometimes measured in days or weeks. Typical HPC applications divide the dataset among multiple processors, achieving parallelism through data decomposition. Data decomposition can be an effective technique in games and video applications as well.

HPC platforms may be threaded, shared-memory systems, or they may rely on message passing for communication and coordination among large collections of more independent nodes. Both concurrency techniques may be used together. Although not as performance sensitive, multithreaded enterprise applications face similar architectural complexity in terms of program correctness.

Threaded SMP systems enjoy a tremendous bandwidth and latency advantage over distributed memory systems, but scaling is limited. Multi-core processors raise the scaling limit of SMP, increasing its applicability to parallel programming problems, whether in HPC or in the enterprise.

Starting Points

Revisions in HPC applications are frequently driven by a user requirement for greater capacity (the ability to handle larger datasets). That may mean more threading to improve performance. In addition, look for ways to apply parallelism to add new capability (the ability to solve new problems with additional resources). The new prevalence of multi-core will make it easier to find third-party parallel components that help in this regard. Both added capacity and added capability are desirable, but while the former keeps you ahead of your competition, the latter can put you in a whole new market.

For development managers, the challenge is not so much to introduce parallelism as to plan development approaches that continually target scalability. That challenge has to be met not only at the start of a new project, but through successive upgrades.

If you're planning the next version of an application, add additional parallelism along with other changes. One approach is to concentrate on the new modules that implement new features, making sure these make the best use of parallelism. For this to be effective, you need to make sure that new code is sufficiently isolated from old code. This won't always be possible, but it's a further argument for walling off new features in separate modules. By keeping new functions modular, you can aggressively add new parallel code while limiting its impact on existing code and limiting the scope of required regression testing. Modularity is a desirable goal unto itself. Because the interfaces between systems are so sharply defined, message-passing systems tend to be more modular than threaded programs.

Amdahl's law is a well-known principal that describes the benefit you can expect from moving portions of a project from serial to parallel processing. The more direct approach to better performance is to follow where Amdahl's law leads and to go after serial regions in existing code. Attacking this problem requires a thorough performance analysis, which historically has meant reading through code, but automated tools can improve the process by increasing coverage. Intel Performance Analyzer's call graph profiling can help here.

Other tools can help to measure the performance of large cluster systems. For applications using MPI (Message Passing Interface), Intel Trace Collector and Intel Trace Analyzer can analyze performance on cluster systems of over 1,000 processors.

There's no working around a bad design, but that doesn't mean that a good design can't be improved by some of the same tools and techniques that you might apply to legacy code. Testing modules for performance as well as correctness is important before introducing the complexity of a fully integrated system. Intel Thread Checker has unique capabilities for debugging threaded applications that are useful at this stage. For performance analysis, Intel Thread Profiler can compare threaded performance of several versions.

Real-World Conditions

There's no way to cover every case with simple rules. What might be a pragmatic solution that keeps a project on schedule might, under different conditions, be a shortsighted fix that hampers long-term performance. Tactical judgment needs to be applied in a strategic context, and there's no substitute for experience in developing that judgment.

Bob Kuhn, Intel's Technical Marketing Director for Advanced Parallel Software Platforms, is a parallel-computing expert and a veteran of many a parallel programming development effort. Kuhn says that many HPC projects at first sought to increase performance by optimizing away the current bottleneck, using the easiest mechanism, then attacking the next bottleneck that cropped up in a similar fashion. "For pragmatists," says Kuhn, "that may provide sufficient performance."

But Kuhn cautions that such an approach has a point of diminishing returns—what he terms the "project manager's version of Amdahl's law." Eventually, the most egregious bottlenecks are eliminated, and each successive target of optimization delivers a lower marginal performance benefit for the same amount of development resources.

Kuhn describes a more sustainable approach. "Analyzing the goal with Amdahl's Law, start by saying everything in the application must eventually be in the parallel region to reach your goal," he says. "What data structures must be parallel and without synchronization?" Improvements along these lines may show lesser short-term speedups per developer-hour, but they have greater prospect for long-term gain, with the benefit coming from data decomposition.

Optimization often makes an application structurally more complex, making it harder to improve overall parallelization after optimizations have been made. "After many changes, you find you have to do much more to switch to data decomposition," says Kuhn.

On the other hand, according to Kuhn, sometimes you have to consider options other than data decomposition, even in HPC. For example, a workflow model might be a more practical first-pass way to quickly integrate third-party programs in your HPC application than a deep parallel integration. In this case, the clean stdin/stdout interface of a workflow approach avoids a number of bugs that would surely crop up in a shared-memory integration of two large, complex pieces of code from different sources.

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel