FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
C++
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
September 05, 2006

Inside the VSIPL++ API

(Page 5 of 7)

Performance

Of course, there are other high-productivity environments (such as Matlab) for prototyping numerical applications. However, VSIPL++ is designed to provide high performance in addition to a convenient programming style. Because VSIPL++ can be used on workstations, supercomputers, or embedded devices, it's easy to prototype an algorithm on a workstation, then move it to the target device.

The VSIPL++ API lets you manually specify the layout of data. For example, it is usually more efficient to arrange a matrix in row-major order if performing computations along the rows, but more efficient to arrange the matrix in column-major order if performing computations along the columns. VSIPL++ programmers can pick whichever arrangement is most convenient.

Sourcery VSIPL++ uses several additional techniques to obtain good performance. Sourcery VSIPL++ can dispatch operations (like FFTs) to math libraries that have been carefully tuned for the target system. For example, on Intel processors, the Intel Performance Primitives (IPP; http://www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/index.htm) provide handwritten code for FFTs. If no library is available for a particular operation, Sourcery VSIPL++ falls back to generic routines.

The generic routines for some computations (like the *= operator used to perform element-wise multiplication in the example) use expression templates to perform loop fusion and eliminate temporaries. In the *= case, this line of code:

tmp *= filters.row(i); 

is transformed into code like this:

for (length_type j=0; j<N; ++j)
  tmp[j] *= filters[j];

This technique is even more effective on code such as:

Vector<T> A, B, C;
A = B + cos(C);

which is transformed into code like:

for (length_type i=0;       i<A.size(0); ++i)
  A[i][j]=B[i][j]+cos(C[i][j]);

which can be compiled to very efficient code.

Some compilers (the GNU C++ compiler, for instance) have extensions that can be used to explicitly request that Single-Instruction Multiple-Data (SIMD) units on the processor be used to perform several computations at once. Sourcery VSIPL++ uses these extensions when they are available.

The net result of these techniques is that the Sourcery VSIPL++ performance for an application is usually within a hair's breadth of the performance attained by directly using the underlying low-level math libraries. So, VSIPL++ users get productivity and portability, without sacrificing performance.

On occasion, Sourcery VSIPL++ is able to achieve better performance than the underlying math libraries, due to the use of loop fusion. For example, in the cosine example, using a library like IPP, you would have to perform the addition and cosine operations separately. As a result, you would get code that looks more like this:

for (length_type i = 0; i <       A.size(0); ++i)
  A[i][j] = B[i][j];
for (length_type i = 0; i <       A.size(0); ++i)
  A[i][j] += cos(C[i][j]);

Because there are two separate loops, there is more overhead and an inferior cache access pattern. The loop fusion used by Sourcery VSIPL++ collapses these two loops into a single loop.

Previous Page | 1 Inside the VSIPL++ API | 2 Computation | 3 Implementing in VSIPL++ | 4 Implementing in VSIPL++, Part II | 5 Performance | 6 Parallel Computation | 7 Conclusion Next Page
TOP 5 ARTICLES
No Top Articles.



MICROSITES
FEATURED TOPIC

ADDITIONAL TOPICS

INFO-LINK