C/C++

Accelerating Compute Intensive Functions Using C

By Joe Hanson, April 01, 2006

Joe examines how you accelerate application processing using a software-configurable architecture to achieve hardware-accelerated performance in C.

Joe is director of business development at Stretch Inc. He can be contacted at hanson@ stretchinc.com.

Electronic devices, whether for embedded, industrial, consumer, entertainment, or communications applications, need to process more data in shorter periods of time than ever before. Developers often choose between a general-purpose processor and digital signal processor (DSP). General-purpose processors have long been the architecture of choice for applications where flexibility is key, while DSPs are selected for the computational capabilities. In many cases, both features are needed. While increasing the clock rate of a general-purpose processor extends its performance capacity, it also increases cost and power consumption. To meet the compute requirements, hardware acceleration or specialized auxiliary components are added at the expense of increasing the programming complexity. In this article, I explore how to accelerate application processing using a software-configurable architecture to achieve hardware-accelerated performance in C. To illustrate, I present a real-world example of a hardware-accelerating Finite Impulse Response (FIR) filter.

Software-Configurable Processors

Most high-performance markets are in a constant state of flux, chasing evolving standards and ever-changing system requirements. ASIC-based architectures, while offering the needed performance, cannot quickly or cost-effectively meet changing application specs. As a result, many developers have begun to turn to software-configurable architectures, which merge programmable logic with an application processor.

The advantage of a software-configurable architecture is that programmable gates are organically integrated within the same pipeline as the application processor. This is in contrast to coprocessor-based architectures, which use an FPGA or ASSP to offload processing as a separate system component. The coprocessor approach introduces complex application partitioning, and requires the application processor to stall or implement complex scheduling mechanisms while waiting for results. Additionally, designing for an FPGA requires a separate development environment from the application software and often a second development team as well.

Because software-configurable processors implement programmable logic in the same pipeline as the application processor, the compiler can manage partitioning of an algorithm into hardware and software, as well as manage any dependencies. In effect, you can treat "hardware as software," writing application code in a single development environment that results in hardware and software optimized to work together. Instead of spending significant development resources hand-tuning application code, you can highlight computational hotspots for the compiler to implement in hardware. These hotspots are implemented as extension instructions, which the application processor executes in the same pipeline as traditional instructions. The difference is that extension instructions represent hundreds to thousands of C instructions and computations and effectively execute a single cycle.

The Basic FIR Filter

FIR filters contain computational hotspots that lend themselves to parallel implementations. While this discussion focuses on how to hardware-accelerate a C implementation of such a filter, the optimization concepts are general enough to apply to any algorithm.

A FIR filter can be described using the following equation:

t=T-1
y(n)=SUM ((h(t)*x)(n-t)) for n=0,...N-1
t=0

where x(n) is the input signal, y(n) is the output signal, and h(t) is the FIR filter coefficient. Figure 1 shows a useful way of visualizing the FIR function--with every dot on the grid--representing the product of a coefficient h(t) and data point x(n-t) and every diagonal line on all the products that must be added together to obtain the output y(n). Listing One offers a straightforward C implementation of the FIR filter. A 64-tap filter using T=64 and N=80 would require approximately 27,230 cycles to execute using such an implementation. Given its inherent parallel structure, it is an excellent candidate for hardware acceleration in an application.

Key Acceleration Considerations

When determining how to exploit parallelism in an algorithm, it is important to consider several architectural characteristics:

Bus width: How data is passed between the main CPU has a significant impact on overall efficiency and throughput because the bus limits how quickly data can be transferred, both as an input to the accelerator and as an output. Too narrow or slow a bus limits the flow of data to the hardware instantiation of the extension instruction and becomes a processing bottleneck; it doesn't matter how efficient the extension instruction is if it cannot be fed data fast enough to keep it fully utilized.

In general, the frequency of the bus determines the frequency with which extension instructions can be issued. The width of the bus determines how much data each iteration of an extension instruction can process. Note that some architectures offer multiple buses between the main CPU and accelerated hardware. For example, the S5 software-configurable processor from Stretch (the company I work for) has three read and two write ports between the application processor and programmable fabric, each 128-bits wide and providing up to 384 read bits and 256 write bits per extension instruction issue. Efficient utilization of these buses enables more efficient exploitation of parallelism.

In bandwidth calculations, it is important to be sure to include the bandwidth required to transfer results back to the main CPU as well as to handle intermediate results, if necessary.

State registers: Accelerating a complex algorithm often calls for partitioning the algorithm in such a way that acceleration hardware computes only a portion of the result for each iteration. Because the filter is executed as a partial function (that is, only a discrete segment of the overall result is computed by an extension instruction), there are often intermediate results that must be passed on to the next call to the function. For example, in the initial C implementation of the FIR filter, a running sum or intermediate result must be maintained between each iteration of the main algorithm loop. When more advanced optimization techniques are applied, the number of intermediate results required increases.

To preserve these intermediate results between iterations of the accelerated function, the values are often passed back over the data bus to the main CPU, which then sends them right back to the function for the next iteration. This process consumes not only limited bus bandwidth but also precious main CPU cycles, reducing overall processing efficiency.

Such inefficiencies can be avoided through the use of state registers. State registers provide a mechanism for storing intermediate results between function iterations without involving the main CPU or passing data over the bus. As a result, architectures that utilize state registers can increase overall parallelism more efficiently. The number of available state resources further defines how extensive the exploitation of parallelism in the algorithm can be.

Data size: Many architectures provide a limited subset of data-width options. For example, a Boolean variable requires only a single bit but is often implemented as at least a byte in size. While this results in increased efficiency for the CPU in general—the cost and performance of masking for a single bit every time the value is needed exceeds the value of conserving 7 bits—it can introduce inefficiencies when passing data to an extension instruction if the CPU must strip and pack bits directly. Additionally, to conserve programmable logic resources, registers must be able to be defined as to eliminate unused bits. Finally, bus bandwidth can also be maximized if unused bits are eliminated before they are passed to the bus. Ideally, these issues should be dealt with within the programmable logic because requiring action from the CPU reduces the overall rate at which the CPU can issue extension instructions.

Parallel Computations

The first step to optimizing the hardware implementation of the FIR filter is to determine the number of data points that the bus can efficiently pass to the accelerated hardware. With a wide/fast data bus, multiple data values can be transferred simultaneously. For example, an extension instruction performing 8 multiplies and additions (MACs) would reduce the number of inner loop iterations by a factor of 8. Each pipelined extension instruction executes in effectively a single cycle, so there should be approximately an 8-time boost in overall performance. Note that if a particular dataset does not divide evenly by 8, the dataset can be extended with zeros up to 8 values so that a special end-loop instruction is not necessary.

The multiple data approach is illustrated in Figure 2, where the highlighted group of data points represents the calculations made by a single extension instruction. Each time the extension instruction is executed, it multiplies the 8 data values by the appropriate coefficients and accumulates all the products. The resulting partial sum is saved to a local state variable so that subsequent instructions can add their partial sums to it, thereby eliminating the need to pass the partial sum back to the CPU and then back to the extension instruction again.

Performing 8 MACs per extension instruction reduces the theoretical minimum number of cycles required to execute the example FIR filter down to 1941. Compared to a straight C implementation, this is a performance gain of 15 times.

The optimal number of simultaneous computations is dependent upon the number of programmable resources available. Conservation of programmable resources is critical in enabling the maximum possible performance in the system overall, as these resources are shared between all of the various extension instructions the CPU currently has access to. Therefore, it is important to reuse resources whenever possible.

The efficiency of reuse can be illustrated through the use of two extension instructions, FIR_MUL and FIR_MAC, respectively (see Listing Two). The first time the FIR filter is called, the partial sum state variable needs to be reset; this is handled by FIR_MUL. In the case of a 64-tap filter, the seven subsequent iterations call FIR_MAC, which uses the existing partial sum. The primary difference between FIR_MUL and FIR_MAC is captured in the following line of code:

acc = FIR_MUL ? sum : se_int<32> (sum + acc)

which effectively resets the partial sum if the extension instruction is called via FIR_MUL, and uses the partial sum if called via FIR_MAC.

The advantage of this type of implementation is the reuse of the programmable logic allocated for the rest of the function. Instead of doubling up the resources required for these two extension instructions, the same programmable resources are shared. The trade-off is the use of the conditional resource and the associated latency of resolving the conditional optimization, which in this case, has no material impact on performance. As a consequence, more resources are available to implement other extension instructions or to further optimize this instruction.

More of a Good Thing Is Better

Further optimization is possible by increasing the number of data points processed each time an extension instruction is executed. Because of the flexible implementation possible with integrated programmable logic, extension instructions can be revised to improve performance without substantial redesign. In contrast, ASIC- and FPGA-based architectures require that you completely rebuild, tune, and debug modified implementations. ASIC and FPGA designs take place in a separate and distinctly different development environment requiring hardware expertise and often handled by a second development team. Therefore, a repartitioning or modification to accelerated hardware is met with resistance and must be justified before results can be accurately evaluated through actual implementation.

When hardware is abstracted as software, the same compiler that generates application code also generates the configuration for the programmable logic. In this way, the compiler is able to efficiently manage resources and dependencies, maximizing throughput. Additionally, because the hardware implementation is automatically generated by the compiler, only a single development team is required. Not only does this limit the complexity of handing off a design to another team, the automatic generation of the programmable configuration gives you the freedom to test and evaluate multiple implementations without having to invest significant development hours in creating hand-optimized implementations.

For example, the FIR filter can be further optimized by reducing the number of times the extension instruction must be invoked. Not only does this eliminate the headache of scheduling extension instructions, it also reduces the overall overhead associated with calling extension instructions.

Consider increasing performance by increasing the number of simultaneous MACs to 16. Figure 3 illustrates the 4×4 block of data points that would be computed. Computing a block of data points, at least for a FIR filter, requires only 4 coefficients and 7 data points; compare this to 16 data points and 16 coefficients for an implementation that calculates the partial sum on a single diagonal line. This enables an efficient use of bus bandwidth because more computations can be made utilizing less bandwidth. The primary trade-off is that 4 partial sums must now be stored in state registers for use in the next iteration of the extension instruction. Maximizing performance, then, is a balance between optimizing the use of available programmable resources.

Moving to 16 MACs per extension instruction doubles the performance of an 8-MAC implementation, reducing the minimum number of cycles to 981 for a 30-time gain over an unaccelerated implementation. Further performance improvement is possible by allocating more programmable resources to execute 32 MACs (see Figure 4). In this configuration, the extension instruction requires 4 coefficients and 11 data samples. Eight partial sums are kept from instruction to instruction. The theoretical minimum number of cycles for such an implementation is 501, approximately 58 times more efficient than a straight C implementation.

Loop Optimizations

While increasing the number of simultaneous computations offers the most significant means for accelerating performance, it is also possible to achieve substantial improvements through conventional loop optimizations. Consider, for example, an implementation of a FIR filter performing 8 simultaneous computations—this example (see Listing Three) was chosen rather than the 32 simultaneous computations example to keep the inner loop code fairly simple so as to better illustrate the optimization techniques.

In this implementation, the use of extension instructions has provided a significant improvement in performance for the function fir(), dropping from 27,230 cycles to 5065 cycles, with no low-level assembly coding or significant modification to the structure of the original C source algorithm. However, there is inefficiency in this implementation from the presence of wasted cycles. Wasted cycles arise because, after invoking the last FIR_MAC instruction for a given outer loop iteration, the processor must wait a number of cycles equal to the issue rate times the extension instruction latency before the results are available to the processor.

One way to recover these wasted cycles is by starting the computations for the next outer loop while waiting for the previous output. Because the number of processor cycles required to complete issuing the extension instruction is greater than the number of cycles required to wait for the output, the CPU is guaranteed that upon issuing the first extension instruction for the current samples, the output from previous samples will be available for the processor to use.

It is also possible to optimize between loop iterations. Given that the inner loop calculation for a particular outer loop iteration is independent of the next or previous outer loop generation, it is possible to offset waiting for the result by issuing a new outer loop iteration. Consider that after the last FIR_MAC is issued for outer loop iteration 0, the CPU could begin the inner loop calculations for outer loop iteration 1 without waiting for the extension instructions issued for outer loop iteration 0 to complete. In this implementation, the last FIR_MAC takes roughly 21 processor cycles before providing a result. By pipelining inner loop calculations in the outer loop in this fashion, it is possible to reduce the number of processor cycles required by 18 for every iteration, with the exception of the last outer loop iteration because there are no further computations with which to utilize the cycles. Given that this occurs only once per fir() invocation, the overhead is negligible.

An implementation based on these techniques (Listing Four) drops the total number of cycles per call to 3189, down from 5065. Thus, with minimal effort and modification to application code, a further 8-time improvement can be achieved.

Another technique for improving performance is to manually unroll the inner loop. Because the inner loop is managed using a variable—the FIR macro is written for a generic number of taps—the compiler is unable to estimate the number of times the loop will be executed, so there will be conditionals within the inner loop. Conditionals, however, adversely affect scheduling of extension instructions because they flush the processor pipeline, leading to processor stalling if the processor misses the cycle in which it can issue an extension instruction.

Implementing loop-unrolling optimizations (Listing Five) results in increased performance for little effort, bringing the total number of cycles down to 2711 per call, for an overall 10-time improvement. Bringing all these techniques together using a 32-MAC extension instruction results in a function that executes in 687 cycles, over a 39-time performance improvement over the original straight C code (Table 1).

By integrating programmable logic into an application processor and its pipeline, software-configurable architectures are able to provide substantial hardware acceleration of computationally intensive algorithms, with little to no manual optimization. By being able to design "hardware as software," developers are able to avoid the complexities associated with partitioning an algorithm implementation between software and hardware. Because the complexities of the actual hardware implementation are handled by the compiler, developers are able to quickly and easily design complex algorithms, and evaluate the efficiency of various implementations to maximize performance and control cost.

DDJ

Listing One

void fir(short *X, short *H, short *Y, int N, int T)
{
     int n, t, acc;
     short *x, *h;
     /* Filter Input */
     for (n = 0; n < N; n++) {
          x = X;
          h = H;
          acc = (*x--) * (*h++);
          for(t = 1; t < T; t++) {
               acc += (*x--) * (*h++);
          }
          *Y = acc >> 14;     
          X++;
          Y++;
     }
}

Back to article

Listing Two

#include <stretch.h>
static se_sint<32> acc; 

/* Performs 8 parallel MAC */ 
SE_FUNC void 
firFunc(SE_INST FIR_MUL, SE_INST FIR_MAC, WR X, WR H, WR *Y) 
{ 
      se_sint<16> x, h
      se_sint<32> sum ;
      int i ;
      sum = 0;
      for(i = 0; i < 128; i += 16) { 
          h = H(i + 15, i); 
          x = X(127-i, 112-i);
          sum += x * h ; 
      } 
      acc  = FIR_MUL ? sum : se_sint<32>(sum + acc) ;
      *Y = acc >> 14 ; 
}

Back to article

Listing Three

#include "fir8.h" 

#define   ST_DECR   1
#define   ST_INCR   0
 
void fir(short *X, short *H, short *Y, short N, short T)
{
   int n, t, t8;
   WR x, h, y;
   t8 = T/8;
   WRPUTINIT(ST_INCR, Y) ;
   for (n = 0; n < N; n++) { 
        WRGET0INIT(ST_INCR, H) ;
        X++ ;
        WRGET1INIT(ST_DECR, X) ;
        WRGET0I( &h, 16 );
        WRGET1I( &x, 16);
        FIR_MUL(x, h, &y);
        for (t = 1; t < t8; t++) {
             WRGET0I(&h, 16);
             WRGET1I(&x, 16);
             FIR_MAC(x, h, &y);
        }
        WRPUTI(y, 2) ;
  }
   WRPUTFLUSH0() ;
   WRPUTFLUSH1() ;
}

Back to article

Listing Four

/* Include the Stretch Instruction Specific Header */
#include "fir8.h"
#define   ST_DECR   1     /* Decrement Indicator   */
#define   ST_INCR   0                              
/* Increment Indicator */
/* define macro for the FIR ISEF instruction invocations */
#define             FIR(H, X, h, x, t8, y)             \
{                                                      \
     int t8m1 = (t8)-1;                                \
     WRGET0INIT(ST_INCR, (H)) ;                        \
     (X)++ ;                                           \
     WRGET1INIT(ST_DECR, (X)) ;                        \
     WRGET0I( &(h), 8 * sizeof(short) );               \
     WRGET1I( &(x), 8 * sizeof(short) );               \
     FIR_MUL( (x), (h), &(y) );                        \
                                                       \
     for (t = 1; t < (t8m1); t++)                      \
     {                                                 \
          WRGET0I( &(h), 16 );                         \
          WRGET1I( &(x), 16 );                         \
          FIR_MAC( (x), (h), &(y) );                   \
     }                                                 \
     WRGET0I( &(h), 16 );                              \
     WRGET1I( &(x), 16 );                              \
     FIR_MAC( (x), (h), &(y) );                        \
}
/*
* - FIR using 8 multipliers in ISEF 
* - Loop optimized
*/
void fir(short *X, short *H, short *Y, short N, short T)
{
     int n, t, t8 ;
     WR x, h, y1, y2, y3, y4;
     t8 = T/8 ;
     WRPUTINIT(ST_INCR, Y) ;            /* init output stream */
     FIR (H, X, h, x, t8, y1) ;         /* x * h + y => y1 */
     /* loop ((N/2)-1) times */
    n = 0;
    do
    {
          FIR (H, X, h, x, t8, y2) ;    /* x * h + y => y2 */
          WRPUTI(y1, 2) ;                         /* put (y1) result */
          FIR (H, X, h, x, t8, y1) ;    /* x * h + y => y1 */
          WRPUTI(y2, 2) ;               /* put (y2) result */
     } while ( ++n < ((N>>1)-1) );
     FIR (H, X, h, x, t8, y2) ;         /* x * h + y => y2 */
     WRPUTI(y1, 2) ;                    /* put (y1) result */
     WRPUTI(y2, 2) ;                    /* put (y2) result */
     WRPUTFLUSH0() ;                    /* flush output stream */
     WRPUTFLUSH1() ;                    /* flush output stream */
}

Back to article

Listing Five

/* Include the Stretch Instruction Specific Header */
#include "fir8.h"

#define             ST_DECR             1         /* Decrement Indicator */
#define             ST_INCR             0         /* Increment Indicator */
#define             FIR(h1, h2, h3, h4, h5, h6, h7, h8, x1, x2, y1, X)   \
{                                                      \
   WRGET0I( &(h1), 8 * sizeof(short) );                \
   WRGET1I( &(x1), 16 );                               \
   X++ ;                                               \
   WRGET0I( &(h2), 16 );                               \
   WRGET1I( &(x2), 16 );                               \
   FIR_MUL( (x1), (h1), &(y1) );                       \
                                                       \
    WRGET0I( &(h3), 16 );                              \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x2), (h2), &(y1) );                      \
    WRGET0I( &(h4), 16 );                              \
    WRGET1I( &(x2), 16 );                              \
    FIR_MAC( (x1), (h3), &(y1) );                      \
    WRGET0I( &(h5), 16 );                              \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x2), (h4), &(y1) );                      \
    WRGET0I( &(h6), 16 );                              \
    WRGET1I( &(x2), 16 );                              \
    FIR_MAC( (x1), (h5), &(y1) );                      \
    WRGET0I( &(h7), 16 );                              \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x2), (h6), &(y1) );                      \
    WRGET0I( &(h8), 16 );                              \
    WRGET1I( &(x2), 16 );                              \
    FIR_MAC( (x1), (h7), &(y1) );                      \
    WRGET1INIT(ST_DECR, X);                            \
    FIR_MAC( (x2), (h8), &(y1) );                      \
}
#define  FIR1(h1, h2, h3, h4, h5, h6, h7, h8, x1, x2, y1, y2, X)  \
{                                                      \
    WRGET1I( &(x1), 16 );                              \
    FIR_MUL( (x1), (h1), &(y2) );                      \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x1), (h2), &(y2) );                      \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x1), (h3), &(y2) );                      \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x1), (h4), &(y2) );                      \
    WRGET1I( &(x1), 16 );                              \
    X++ ;                                              \
    FIR_MAC( (x1), (h5), &(y2) );                      \
    WRGET1I( &(x1), 16 );                              \
    WRGET1I( &(x2), 16 );                              \
    FIR_MAC( (x1), (h6), &(y2) );                      \
    WRGET1I( &(x1), 16 );                              \
    WRGET1INIT0(ST_DECR, X);                           \
    FIR_MAC( (x2), (h7), &(y2) );                      \
    WRGET1INIT1();                                     \
    WRPUTI(y1, 2);                                     \
    FIR_MAC( (x1), (h8), &(y2) );                      \
}
#define   FIR2(h1, h2, h3, h4, h5, h6, h7, h8, x1, x2, y1, y2, X)  \
{                                                      \
    WRGET1I( &(x1), 16 );                              \
    FIR_MUL( (x1), (h1), &(y1) );                      \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x1), (h2), &(y1) );                      \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x1), (h3), &(y1) );                      \
    WRGET1I( &(x1), 16 );                              \
    FIR_MAC( (x1), (h4), &(y1) );                      \
    WRGET1I( &(x1), 16 );                              \
    X++ ;                                              \
    FIR_MAC( (x1), (h5), &(y1) );                      \
    WRGET1I( &(x1), 16 );                              \
    WRGET1I( &(x2), 16 );                              \
    FIR_MAC( (x1), (h6), &(y1) );                      \
    WRGET1I( &(x1), 16 );                              \
    WRGET1INIT0(ST_DECR, X) ;                          \
    FIR_MAC( (x2), (h7), &(y1) );                      \
    WRGET1INIT1();                                     \
    WRPUTI(y2, 2);                                     \
    FIR_MAC( (x1), (h8), &(y1) );                      \  
}
/*
 * - FIR using 8 multipliers in ISEF
 * - Loop optimized / Hand unrolled
 */
void fir(short *X, short *H, short *Y, short N, short T)
{
     int n, t, t8 ;
     WR  h1, h2, h3, h4, h5, h6, h7, h8 ;
     WR  x1, x2;    
     WR  y1;
     WR  y2;
     // (these alternative "register" declarations make no difference:)
     //  register WR y1 SE_REG("wra1") ;
     //  register WR y2 SE_REG("wra2") ;
     WRPUTINIT(ST_INCR, Y);              /* init output stream */
     WRGET0INIT(ST_INCR, H);             /* init coefficient stream */
     X++ ;
     WRGET1INIT(ST_DECR, X);             /* init input stream */
     /* compute Y[0] in y1 */
     FIR(h1, h2, h3, h4, h5, h6, h7, h8, x1, x2, y1, X) ;
     /* loop ((N/2)-1) times */    
     for (n = 0; n < ((N>>1)-1); n++)
     {
/* FIR1 writes previous output (y1) and computes current output (y2) */
          FIR1(h1, h2, h3, h4, h5, h6, h7, h8, x1, x2, y1, y2, X) ;
/* FIR1 writes previous output (y2) and computes current output (y1) */
          FIR2(h1, h2, h3, h4, h5, h6, h7, h8, x1, x2, y1, y2, X) ;
     }
     /* compute Y[N-1] in y2 and write Y[N-2] from y1 */
     FIR1(h1, h2, h3, h4, h5, h6, h7, h8, x1, x2, y1, y2, X) ;
     WRPUTI(y2, 2) ;                     /* write U[N-1] */
     WRPUTFLUSH0() ;                     /* flush output stream */
     WRPUTFLUSH1() ;                     /* flush output stream */    
}

Back to article

1 2 3 4 5 6 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

Accelerating Compute Intensive Functions Using C

Software-Configurable Processors

The Basic FIR Filter

Key Acceleration Considerations

Parallel Computations

More of a Good Thing Is Better

Loop Optimizations

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

Accelerating Compute Intensive Functions Using C

Software-Configurable Processors

The Basic FIR Filter

Key Acceleration Considerations

Parallel Computations

More of a Good Thing Is Better

Loop Optimizations

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content