August 09, 2006
How to accelerate algorithms by automatically generating FPGA coprocessorsRecent advances in C-to-FPGA design methodologies and tools facilitate the rapid creation of hardware-accelerated embedded systems.Glenn Steiner, Kunal Shenoy, Dan Isaacs (Xilinx), and David Pellerin (ImpulseC)
Recent advances in C-to-FPGA design methodologies and tools facilitate the rapid creation of hardware-accelerated embedded systems.
Today's designers are constrained by space, power, and cost, and they simply cannot afford to implement embedded designs with gigahertz-class computers. Fortunately, in embedded systems, the greatest computational requirements are frequently determined by a relatively small number of algorithms. These algorithms, identified through profiling techniques, can be rapidly converted into hardware coprocessors using design automation tools. The coprocessors can then be efficiently interfaced to the offloaded processor, yielding "gigahertz-class" performance.
In this article, we explore code acceleration and techniques for code conversion to hardware coprocessors. We also demonstrate the process for making trade-off decisions with benchmark data through an actual image-rendering case study involving an auxiliary processor unit (APU)-based technique. The design uses an immersed PowerPC implemented in a platform FPGA.
The value of a coprocessor The most frequently used coprocessor is the floating-point unit (FPU), the only common coprocessor that is tightly coupled to the CPU. There are no general-purpose libraries of coprocessors. Even if there were, it is still difficult to readily couple a coprocessor to a CPU, such as a Pentium 4. As shown in Fig 1, the Xilinx Virtex-4 FX FPGA has one or two PowerPCs, each with an APU interface. By embedding a processor within an FPGA, you now have the opportunity to implement complete processing systems of your own design within a single chip.
![]() 1. Virtex-4 FX processor with APU interface and EMAC blocks. The integrated PowerPC with APU interface enables a tightly coupled coprocessor that can be implemented within the FPGA. Frequency requirements and pin number limits make an external coprocessor less capable. Thus, you can now create application-specific coprocessors attached directly to the PowerPC, providing significant software acceleration. Because FPGAs are reprogrammable, you can rapidly develop and test CPU-attached coprocessor solutions.
Coprocessor connection models
As a directly connected interface, the instruction-pipeline connected accelerators can be clocked faster than a processor bus. The Xilinx implementation for this type of coprocessor connection model through the APU interface demonstrates a 10x clock cycle reduction in the control and movement of data for a typical double-operand instruction. The APU controller is also connected to the data-cache controller and can perform data load/store operations through it. Thus, the APU interface is capable of moving hundreds of millions of bytes per second, approaching DMA speeds. Either I/O-connected accelerators or instruction-pipeline-connected accelerators can be combined with bus-connected accelerators. At the cost of additional logic, you can create an accelerator that receives commands and returns status through a fast, low-latency interface while operating on blocks of data located in bus-connected memory. The C-to-HDL tool set described in this article is capable of implementing bus-connected and I/O-connected accelerators. It is also capable of implementing an accelerator connected to the APU interface of the PowerPC. Although the APU connection is instruction-pipeline-based, the C-to-HDL tool set implements an I/O pipeline interface with a resulting behavior more typical of an I/O-connected accelerator.
FPGA / PowerPC / APU interface FPGAs, being reprogrammable elements, allow you to program parts and test them at any stage during the design process. If you find a design flaw, you can immediately reprogram a part. FPGAs also allow you to implement hardware computing functions that were previously cost-prohibitive. The tight coupling of a CPU pipeline to FPGA logic, as in the Virtex-4 FX PowerPC, enables you to create high-performance software accelerators. A block diagram showing the PowerPC, integrated APU controller, and an attached coprocessor is shown in Fig 2. Instructions from cache or memory are simultaneously presented to the CPU decoder and the APU controller. If the CPU recognizes the instruction, it is executed. If not, the APU controller or the user-created coprocessor has the opportunity to acknowledge the instruction and execute it. Optionally, one or two operands can be passed to the coprocessor and a result or status can be returned. The APU interface also supports the ability to transfer a data element with a single instruction. The data element ranges in size from one byte to four 32-bit words.
![]() 2. PowerPC, integrated APU controller, and coprocessor. One or more coprocessors can be attached to the APU interface through a fabric coprocessor bus (FCB). Coprocessors attached to the bus range from off-the-shelf cores, such as an FPU, to user-created coprocessors. A coprocessor can connect to the FCB for control and status operations and to a processor bus, enabling direct access to memory data blocks and DMA data passing. A simplified connection scheme, such as the FSL, can also be used between the FCB and coprocessor, enabling FIFO data and control communication at the cost of some performance. To demonstrate the performance advantage of an instruction-pipeline-connected accelerator, we first implemented a design with a processor bus-connected FPU and then with an APU/FCB-connected FPU. Table 1 summarizes the performance for a finite impulse response (FIR) filter for each case.
![]() Table 1. Non-accelerated vs. accelerated floating-point performance. As is reflected by the table, an FPU connected to an instruction pipeline accelerates software floating-point operations by 30X, while the APU interface provides a nearly 4X improvement over a bus-connected FPU.
|
|
||||||||||||||||||||||||||||
|
|
|
|