Efficient Multi-Core Partitioning
Efficient partitioning of complex algorithms such as video encoders requires a combination of the two partitioning techniques described above, and the ability to assign tasks to processors at run time instead of compile time whenever appropriate. To overcome the limitations of data partitioning, the granularity of the blocks being processed by individual processors needs to be smaller than a slice, which introduces data dependencies that need to be dealt with. This granularity level may be at macroblock (MB) level or the level of a small group of MBs. Bringing down the granularity of data partitioning to a finer level and combining it with data pipelining creates a large number of individual tasks. This large number of tasks that can be allocated to processors at run time is the key to an efficient use of multi-core architecture resources.
Many challenges are in the way of this approach. How do you define tasks to minimize data dependencies? How do you decide in which order tasks need to be processed to ensure that there will always be new tasks available when a processor becomes available and despite the fact that the processing requirements of some of the tasks may vary drastically with the data being processed? How do you ensure that the task-switching overhead -- the time spent between when a processor completes a task and starts the next task -- remains small? How do you ensure that this partitioning approach is scalable so that you can assign a variable number of processors to one algorithm depending on the other algorithms running in parallel and the respective processing requirements? How do you ensure that each processor has enough fast memory available to process their tasks efficiently given that some tasks have much higher memory requirements than others and that the amount of fast memory is limited and shared across many processors?
The answers to many of these questions depend on the application being targeted, the multi-core architecture being used, and the software libraries and tools provided by the processor vendor to develop and debug code running on that architecture. In this article we focus on how Cradle implemented the MPEG-4 encoder on the CT3600 chip and provide elements of answers to many of these questions. We start with a brief overview of Cradle CT3600 architecture and the structure of video encoders like MPEG-4. We then discuss in detail how the MPEG-4 encoder was partitioned on the CT3600 architecture.
The CT3600 MDSP family
The Cradle CT3600 family of Multi-core DSP (MDSP) processors is a family of heterogeneous multi-core chips, accompanied by an easy-to-use multi-core programming system that comprises development, debug and profiling capabilities. One platform can be reprogrammed for any or all of the multi-channel, multi-application products.
The Cradle CT3600 architecture has up to 8 RISC processors and 16 DSPs. It is a shared data memory architecture with all elements having their own instruction memory and 32-bit wide register files. Cradle defines a group of 4 RISC processors as a Quad. Associated with a Quad are 8 DSPs, 128k bytes of shared data memory and nine 8-bit Programmable I/O Ports, each embedding a CPLD and state machine (Figure 1).
Figure 1: CT3616 architecture block diagram
Global resources include a PCI Bus interface and DDR-SDRAM controller with multiple DMA channels, Global Semaphores and bus-performance monitors.
Co-designed with the processor architecture is the Cradle SDK -- a multi-core simulator and debugger Software Development Kit. All 24 processors and all I/Os can either be simulated or accessed in the hardware directly through a JTAG or PCI interface.
Next: MPEG-4 Encoder Structure