February 20, 2007
Programming and Optimizing C Code: Part 1This first of a five-part series introduces the basic principles of writing C code for a DSP processor. It also explains how to profile and optimize code.Alan Anderson, Analog Devices
This first of a five-part series introduces the basic principles of writing C code for a DSP processor. It also explains how to profile and optimize code.
[Editor's note: Part 2 of this series shows how to optimize DSP "kernels," i.e., inner loops. For more programming tips, see the DSP programmer's guide.]
Introduction
C enables you to bring in a portable program and quickly experiment with it in order to ascertain the performance potential. Management loves this because they see working—if slow—product elements from early in the development cycle. However, C has its own problems rooted in the semantic gaps between programming language, design and hardware:
For all of these reasons and more, the performance of compiled code may be far less than that of hand-written assembly—or at least this is the case if you take a simplistic approach to writing your C code. In this series we will look at ways to tune up the performance of the C programs so you can avoid the assembly option. First, some basics.
Understanding the Application and the Processor You must also gain a low level awareness of the DSP processor's capability. The machine's characteristics will ultimately determine the level of performance you can achieve, so you need to understand these capabilities in order to set targets for the performance of your code. It is particularly important to understand your processor's specialized features. For example, the processor may have highly efficient operations for things like a Viteribi decoding, bit multiplexing, or vectorized multiply and accumulates (MACs). You also need to consider the processor's memory system. For example, will the bus capacity support the amount data you hope to process? Once you understand the processor, you must then evaluate whether your algorithm will map well onto into the processor's low level facilities. You must then look at the assembly code emitted by the compiler to decide if you are actually using these facilities efficiently.
Understanding C C also assumes a large flat memory model. In reality, memory access costs can be highly irregular and can dominate application performance. Thus, blindly following C conventions can ruin performance. Perhaps more importantly, "portable" C is surprisingly machine-dependent, even in such basics as what is an "int" is. Depending on the machine, an int may be either 32 or 16 bits wide. Obviously, making the wrong assumption can greatly affect the performance of your code—and can even lead to incorrect operation. There is often a poor match between C and the features of DSPs in the areas of accumulators, vectorization (that is, SIMD hardware), and fractional processing. These hardware features are essential to efficient processing, but they are not natively supported in ANSI C. So the message for the C programmer is that C programs can be ported with little difficulty, but if you want high efficiency, you can't ignore the underlying hardware. Bear in mind that there is a conflict in program design between generality and explicitness. For instance, consider how you access data. If your goal is to maximize generality, you might use highly indirect pointers. The problem with this approach is that it forces the compiler to use a conservative strategy. For example, suppose you write a memcpy function as follows: memcpy( struct->Ptr1, &(ShortArray[*PtrIndex]) , num ); In this scenario, the compiler cannot deduce the data addresses. In order to avoid overlapping or misaligned data, it will generate very slow but safe code, perhaps transferring only one data word at a time:
Cycle 1: Load 16 bits You can make the identity and alignment of the data more obvious by writing the function as: memcpy( IntArray1, IntArray2 , num ); Now the compiler can deploy wide loads and vectorization. On a typical DSP, this will provide an eight-fold improvement in speed:
Cycle 1: Load 32 bits Occasionally you can speed up a program simply by making the C code more elegant, as shown in the memcpy example. More often than not, however, the speedup comes from specializing the program for the hardware. This process leaves you with a faster program, but it also gives you a program that is larger, more complex, and less portable. In other words, there is a price to pay for performance. In order to minimize the price, you should target your optimization work to key areas only, and resist the temptation to write everything for maximum efficiency.
|
|
||||||||||||||||||||||||||||||
|
|
|
|