FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
High Performance Computing
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
TABLE OF CONTENTS
June 03, 2008

Appliances: Adaptable Parallel Computing for Mass Consumption

(Page 2 of 3)

Parallel Execution

Let's look at an image-processing example that does 2D FFTs on each plane of a 3D matrix (images). The original M code looks like this:

% images is an N x N x #images array
for i=1:size(images,3)
   fft_images(:,:,i) = fft2(images(:,:,i));
end

To run this in parallel, the code might look like:

fft_images = ppeval('fft2',images);

where ppeval is essentially a parallel loop over each plane of the input array(s). The Star-P infrastructure takes care of the details of allocating the data to the cores that are active for this session; the user continues to think in the high-productivity M language, but gets the performance advantages of many cores and more memory. In addition to the task parallelism in the example, data parallelism performs a single operation, such as a matrix multiplication or a sort, across an entire array that is distributed across the memory of the parallel nodes. Adapting M or Python codes to run in parallel with Star-P typically takes a day or so, depending on the size of the code, and often yields speed-ups of more than 10X, and even 100X on large systems.

Readers familiar with C and Fortran may interject that those languages often deliver much higher performance than the high-productivity languages for a given algorithm. While that has historically been true, new compilation techniques for the high-productivity languages are closing that gap rapidly while preserving their productivity benefits. While developers wanting every last clock period of performance will usually be able to make C or Fortran run faster on a single core, that advantage must be weighed against the higher-level benefits of being able to go parallel simply with the productivity languages and thereby gain better multi-core performance.

Software Acceleration

Developers of existing appliances will often have created, with considerable effort, optimized versions of the key kernels of their algorithms, usually in C or Fortran. While they may value the greater adaptability of the productivity languages, they must sustain the use of those optimized kernels. Star-P supports this need by its SDK interface. In keeping with the FFT example above, assume that a developer has a single-core C-language FFT routine appl_fft that is customized to the appliance's special circumstances. Using this routine from Star-P involves three steps. The first step involves plugging the C routine into the Star-P infrastructure on the HPC server system. The wrapper code to do this might look like the following (error-checking details omitted for brevity).

static void appl_fft_wrapper(ppevalc_module_t& module,
  pearg_vector_t const& inargs, pearg_vector_t& outargs)
{
// create an output argument of the same type and
// shape as the input argument 
pearg_t outarg(inargs[0].element_type(), inargs[0].size_vector());
// call appl_fft, telling to read its input args 
// directly from inargs, and telling it to write its 
// result directly into the outarg
starp_double_t const* indata = inargs[0].data<starp_double_t>();
starp_double_t *outdata = outarg.data<starp_double_t>();
appl_fft(inargs[0].number_of_elements(), indata, outdata);
outargs[0] = outarg;
}

The second step calls the wrapper from the M language, via:

function out = fft(in)
out = appl_fft_wrapper(in);

The third step, executed in a set-up part of the appliance, then links this wrapper to the productivity language and then places the application-specific FFT routine at the head of the MATLAB search path, might look like:

pploadpackage('C',/path/to/package.so','fft');
setpath('path/to/appliance/')

To preserve compatibility with the unaccelerated source, the core algorithm (the ppeval code above) does not change, but merely uses the new fft function just defined here.

Previous Page | 1 Introduction | 2 Parallel Execution | 3 Hardware Acceleration Next Page
TOP 5 ARTICLES
No Top Articles.



MICROSITES
FEATURED TOPIC

ADDITIONAL TOPICS

INFO-LINK