Parallel Execution
Let's look at an image-processing example that does 2D FFTs on each plane of a 3D matrix (images). The original M code looks like this:
% images is an N x N x #images array for i=1:size(images,3) fft_images(:,:,i) = fft2(images(:,:,i)); end
To run this in parallel, the code might look like:
fft_images = ppeval('fft2',images);
where ppeval is essentially a parallel loop over each plane of the input array(s). The Star-P infrastructure takes care of the details of allocating the data to the cores that are active for this session; the user continues to think in the high-productivity M language, but gets the performance advantages of many cores and more memory. In addition to the task parallelism in the example, data parallelism performs a single operation, such as a matrix multiplication or a sort, across an entire array that is distributed across the memory of the parallel nodes. Adapting M or Python codes to run in parallel with Star-P typically takes a day or so, depending on the size of the code, and often yields speed-ups of more than 10X, and even 100X on large systems.
Readers familiar with C and Fortran may interject that those languages often deliver much higher performance than the high-productivity languages for a given algorithm. While that has historically been true, new compilation techniques for the high-productivity languages are closing that gap rapidly while preserving their productivity benefits. While developers wanting every last clock period of performance will usually be able to make C or Fortran run faster on a single core, that advantage must be weighed against the higher-level benefits of being able to go parallel simply with the productivity languages and thereby gain better multi-core performance.
Software Acceleration
Developers of existing appliances will often have created, with considerable effort, optimized versions of the key kernels of their algorithms, usually in C or Fortran. While they may value the greater adaptability of the productivity languages, they must sustain the use of those optimized kernels. Star-P supports this need by its SDK interface. In keeping with the FFT example above, assume that a developer has a single-core C-language FFT routine appl_fft that is customized to the appliance's special circumstances. Using this routine from Star-P involves three steps. The first step involves plugging the C routine into the Star-P infrastructure on the HPC server system. The wrapper code to do this might look like the following (error-checking details omitted for brevity).
static void appl_fft_wrapper(ppevalc_module_t& module, pearg_vector_t const& inargs, pearg_vector_t& outargs) { // create an output argument of the same type and // shape as the input argument pearg_t outarg(inargs[0].element_type(), inargs[0].size_vector()); // call appl_fft, telling to read its input args // directly from inargs, and telling it to write its // result directly into the outarg starp_double_t const* indata = inargs[0].data<starp_double_t>(); starp_double_t *outdata = outarg.data<starp_double_t>(); appl_fft(inargs[0].number_of_elements(), indata, outdata); outargs[0] = outarg; }
The second step calls the wrapper from the M language, via:
function out = fft(in) out = appl_fft_wrapper(in);
The third step, executed in a set-up part of the appliance, then links this wrapper to the productivity language and then places the application-specific FFT routine at the head of the MATLAB search path, might look like:
pploadpackage('C',/path/to/package.so','fft'); setpath('path/to/appliance/')
To preserve compatibility with the unaccelerated source, the core algorithm (the ppeval code above) does not change, but merely uses the new fft function just defined here.