Embedded Systems

Programming High-Performance DSPs: Part 3

By Rob Oshana and Texas Instruments, November 30, 2006

This third of a three-part series shows how you can help the compiler produce faster code. It explains the drawbacks of software pipelining. It also explains how to optimize for minimum power consumption.

[Part 1 of this series introduced VLIW pipelines, multi-level memory architectures, and Direct Memory Access (DMA).

Part 2 of this series explains how to maximize performance with loop unrolling and software pipelining.]

Figure 12a. C example (repeated for clarity)

In the simple 'for' loop, it is apparent that the inputs are not dependent on the output. In other words there are no dependencies. But the compiler does not know that. Compilers are generally pessimistic creatures. They will not optimize something if the situation is not totally understood. The compiler takes the conservative approach and assumes the inputs can be dependent on the previous output each time through the loop. If it is known that the inputs are not dependent on the output, we can hint to the compiler by declaring the input1 and input2 as "const", indicating that these fields will not change. This is a trigger for enabling software pipelining and saving throughput. This C code is shown in Figure 13 with the corresponding assembly language.

Figure 13a. C example with "const" declaration

Figure 13b. Corresponding pipelined assembly language output

There are a few things to notice in looking at this assembly language. First, the piped loop kernel has become smaller. In fact, the loop is now only two cycles long. Lines 44-47 are all executed in one cycle (the parallel instructions are indicated by the || symbol) and lines 48-50 are executed in the second cycle of the loop. The compiler, with the additional dependency information we supplied it with the "const" declaration has been able to take advantage of the parallelism in the execution units to schedule the inner part of the loop very efficiently. But this comes at a price. The prolog and epilog portions of the code are much larger now. Tighter piped kernels will require more priming operations to coordinate all of the execution based on the various instruction and branching delays. But once primed, the kernel loop executes extremely fast, performing operations on various iterations of the loop. The goal of software pipelining is, like we mentioned early, to make the common case fast. The kernel is the common case in this example, and we have made it very fast. Pipelined code may not be worth doing for loops with a small loop count. But for loops with a large loop count, executing thousands of times, software pipelining is the only way to go.

In the two cycles the piped kernel takes to execute, there are a lot of things going on. The right hand column in the assembly listing indicates what iteration is being performed by each instruction. Each "@" symbol is a iteration count. So, in this kernel, line 44 is performing a branch for iteration n+2, lines 45 and 46 are performing loads for iteration n+4, line 48 is storing a result for iteration n, line 49 is performing a multiply for iteration n+2, and line 50 is performing a subtraction for iteration n+3, all in two cycles! The epilog is completing the operations once the piped kernel stops executing. The compiler was able to make the loop two cycles long, which is what we predicted by looking at the inefficient version of the code.

The code size for a pipelined function becomes larger, as is obvious by looking at the code produced. This is one of the tradeoffs for speed that the programmer must make.

1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Programming High-Performance DSPs: Part 3

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Programming High-Performance DSPs: Part 3

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content