Despite the importance of power consumption and memory use, relatively little emphasis has been placed on optimizing power and memory for embedded applications. This paper will provide some guidelines on optimizing embedded applications for power.
Just as code size and speed impact cost, power consumption also affects cost. The more power consumed by an embedded application, the larger the battery required to drive it. For a portable application, this can make the product more expensive, unwieldy, and undesirable. To reduce power, you need to make the application run in as few cycles as possible, considering that each cycle consumes a measurable amount of energy. In this sense, it would seem that performance and power optimization are similar—consume the fewest number of cycles to get both performance and power optimization goals. Performance and power optimization strategies share similar goals but have subtle differences as will be shown shortly.
But the real power optimization gains come with how data is accessed before being processed by the embedded CPU. Most of the power consumed in an embedded application comes not from the CPU but from the processes used to get data from memory to the CPU. Each time the CPU accesses external memory, buses are turned on, and other functional units must be powered on and utilized to get the data to the CPU. This is where the majority of power is consumed. If the programmer can design embedded applications to minimize the use of external memory, efficiently move data into and out of the CPU, and make efficient use of cache to prevent cache thrashing, the overall power consumption of the application will be reduced significantly. Figure 16 shows the two main power contributors. The compute block includes the CPU and this is where the algorithmic functions are performed. The other is the memory transfer block and this is where the memory subsystems are utilized by the application. The memory transfer block is where the majority of the power is consumed by an embedded application.
Figure 16. The main power contributors for an embedded application are in the memory transfer functions, not in the compute block. (From PowerEscape)
LAST RESORT - ASSEMBLY LANGUAGE
Many times, the C code can be modified slightly to alleviate this situation, but it can take time and several iterations to get the optimal (or close to optimal) solution. The process of refining code in this manner is shown in Figure 17. The last resort is coding the algorithm in assembly language. Assembly language is harder to write, understand, and maintain. Tools have been developed that make it easier for assembly language programmers to write efficient code for superscalar and VLIW processors. Assembly language optimizers, for example, allow the programmer to write serial assembly language and then optimize it into software pipelined loops automatically.
Figure 17. Code optimization process
CONCLUSION
Real time programmers have always had to develop a library of tricks to allow software to run as fast as possible. As processors continue to become more complicated, this becomes a more difficult endeavor. For superscalar VLIW processors, managing two separate pipelines and insuring the highest amount of parallelism requires tools support. Optimizing compilers are helping overcome many of the obstacles of these powerful new processors, but even the compilers have limitations. Real time programmers should not trust the compiler to perform all of the necessary optimizations for you. They need help! The main steps to follow are:
- Study the assembly language produced by the compiler. In many instances, subtle changes to the structure of the C code can make a big difference in how the compiler generates the .asm language. This can make the difference in the real time performance of the system.
- Use the DMA capabilities. Especially for data intensive number crunching applications common in DSP systems. The DMA can take a huge burden off of the CPU and help manage data efficiently.
- Keep the pipelines full. The whole reason superscalar and VLIW processors were invented was to take advantage of parallelism. Look for areas of inefficiency in the assembly language make modifications to allow both pipelines to be used at their full efficiency. This requires an understanding of what the compiler looks for in terms of pipelining opportunities. It also requires an understanding of the application. Many times, just re-arranging the algorithm in a different way can make it run more efficiently on the processor.
References
- TMS320C62X Programmers Guide, Texas Instruments, 1997
- Computer Architecture, A Quantitative Approach, by John L Hennesey and David A Patterson, copyright 1990 by Morgan Kaufmann Publishers, Inc., Palo Alto, CA