May 07, 2008
Performance Portable C++Conclusion
Performance portability is likely to become a necessary part of programming in the near future. Already, some compilers will only create good SSE code for the x86 processor architecture if stride one vectors are used (also known as Array-like layouts). At the same time, older superscalar machines usually perform best when Struct-like layouts are used as often as possible. Without the flexibility to switch memory layouts, performance almost certainly suffers as code is ported, and this is more likely to be so as we head into the future.
The technique I describe here is used in a large multiphysics project that encompasses hundreds of thousands of lines of code. The technique was originally used as a refactoring tool more than a performance portability tool. At one point, the project was able to quickly refactor the hydrodynamics portion of their physics code for a 42-100 percent speedup depending on the problem being solved and the machine being used. The hydrodynamics can be the dominant portion of the runtime for many physics applications, so doubling the performance with just a change of data structures is impressive. Part of the performance gain probably came from the compiler recognizing extra optimizations that could be applied (same compiler flags), and the rest came from different cache latency characteristics.
Thanks to Brian McCandless for questioning me during a presentation on the sidebar material, and inspiring me to write this article. His group uses a variant of this technique in their software.
References
Triangle and Quadrilateral areas (softsurfer.com/algorithms.htm).
Grandy, Jeff. "Efficient Computation of Volume of Hexahedral Cells," Lawrence Livermore National Laboratory UCRL-ID-128886, Oct 30, 1997.
|
|
|||||||||||||||||||||||||||||||
|
|
|
|