Implementation Validation
The partitioning of an algorithm such as an MPEG-4 encoder on a multi-core architecture results in a large number of tasks running on multiple processors in parallel. Because of the data dependencies that exist between these tasks, it is important to ensure that the resulting implementation runs efficiently and that no processor remains inactive for extended periods of time waiting for tasks to complete and data to become available. This validation step is critical for any multi-core implementation but can be challenging.
Figure 3: example view of processor activity obtained using Cradle's Run-Time Analysis Tool (RTA)
In the case of the CT3600, a number of sophisticated multi-core profiling tools are available to help understand the real-time activity of the processor resources being used. These tools can automatically generate various chronological (see the example in Figure 3) or average views of processor and memory resource utilization. Such tools are critical to fine-tune the definition of the tasks, the order in which they need to be processed, and to understand the impact of varying the frame resolution or number of channels that need to be processed.
Conclusion
A good partitioning takes into account the specificity of the algorithm being implemented and the target architecture. A flexible architecture allows the software developer to use a combination of data and functional partitioning approaches in order to realize the full potential of the architecture. A robust set of development and profiling tools is also key in helping the software developer with the implementation process and ensuring that all resources are used efficiently in the final implementation.
In the case of the MPEG-4 encoder implemented on the CT3600, an arbitrary number of processors within one of the quads is assigned to the ME processing block. The various restrictions imposed by the ME on memory utilization makes data partitioning with a compile-time processor allocation the most appropriate approach. The other computational-intensive processing block, the TE, has fewer restrictions and allows for a hybrid partitioning approach combining data and functional partitioning. Processors are allocated to TE tasks at run-time, which achieves processor load balancing by keeping all processors busy as long as there are frames to be processed, regardless of the type of video stream being processed. This approach leads to an efficient use of the computational resources, which can be shared with other applications running in parallel on the chip. The control and other non-compute-intensive tasks are handled by a small number of RISC processors. The resulting implementation is fully scalable, supporting a variable number of channels depending on the resolution and number of processors available.
About the authors
Laurent Bonetto is the Tools Product Marketing Manager at Cradle Technologies, focusing on defining Cradle's next generation multi-core development tools, as well as helping application firmware developers take full advantage of the processing resources available on Cradle's multi-core DSPs. Mr. Bonetto received his M.E. degree from the Georgia Institute of Technology, GA, and from the Computer Engineering School, Supelec, France. He can be reached at [email protected].
Ram Natarajan manages codec development efforts at Cradle Technologies Inc. Ram's prior experience in the field of digital video included stints at Equator Technologies and at IBM India. He has M.E degree in Electrical Engineering and holds two patents. He can be reached at [email protected].
R.K. Singh is the Director of India operations for Cradle Technologies. Dr. Singh's work has been mainly involved with parallel processors and algorithms for parallelizing multi-media applications. He has worked with CDAC, Pune, on image processing algorithms on Transputers. He holds a PhD in Computer Vision. He can reached at [email protected].