October 06, 2006
Multi-core MPEG-4 Video Encode PartitioningPartitioning a video-encoding algorithm onto a multi-core architecture can utilize a variety of techniques, including data partitioning and pipelining. Cradle Technologies explains them, and how to do MPEG-4 Baseline Profile implementation on their multi-core CT3600 processor family.
Laurent Bonetto, Ram Natarajan, and Dr. R K Singh,
Cradle Technologies Partitioning a video-encoding algorithm onto a multi-core architecture can utilize a variety of techniques, including data partitioning and pipelining. Cradle Technologies explains them, and how to do MPEG-4 Baseline Profile implementation on their multi-core CT3600 processor family.
Partitioning video processing algorithms onto multi-core architectures has been researched for decades, and over this time several techniques of varying efficiency have been developed to divide up the work among the processors. Let's take a closer look at some of these techniques, and see how video processing poses unique challenges to the multi-core processor.Data Partitioning One partitioning technique commonly used is called data partitioning, which relies on the ability of processing data blocks in parallel. This technique is most commonly and easily applied at a high-granularity, each data block being a channel, a frame, or a slice, where a slice refers to a large area of a frame that is processed independently from the rest of the frame. Each data block is processed in parallel on a different processor. A master processor is generally responsible for ensuring synchronization among the processors and combining the results as needed.
Applying this partitioning approach at a high-granularity presents the advantage of requiring only a minimal amount of inter-processor communication, since each block can be processed independently from the others. This approach is also easy to implement as only few modifications to the existing single-core oriented reference code are required in order to run in parallel on all processors and produce functional output. However, this technique has also inherent problems. First, it is difficult to ensure proper load balancing among processors as video codec algorithms have data-dependent processing requirements. For example, one slice of a frame (or video sequence) may contain scenes with small amount of movement and details (e.g., a uniform background such as a wall or the sky), resulting in much lower processing requirements than another slice containing high-motion scenes with finer details (e.g., the face of a person talking).
Because of these discrepancies, some processors remain idle for long periods of time while others are busy processing the most computationally intensive scenes: this unbalance translates as a waste of the processing resources and suboptimal performance. Second, simple data partitioning results in non-scalable implementations, since there is little flexibility on the number of blocks in which the data can be divided. For example, a 16-channel encoder may fit nicely on a 16-processor architecture by assigning one processor to each channel, but the code will need to be reworked significantly if another application needs to run in parallel on that architecture and mobilizes one or more processors for extended periods of time. Dividing frames into slices offer slightly more flexibility as the size and number of slices can generally be adjusted without requiring extensive code changes. Unfortunately, dividing a frame into too many slices deteriorates the efficiency of the compression algorithms.
Unfortunately, this second partitioning technique also introduces multiple challenges. First, the assignment of each processing block to a processor is generally done at compile time: this is the simplest approach and sometimes the only approach that is possible because of the limitations of the architecture or the lack of a multi-core operating system running on the architecture. Assigning roles to each processor at compile-time does not allow for proper load balancing. For example, motion estimation processing requirements vary greatly depending on the video streams being processed: still and low-motion sequences result in lower processing requirements for the motion estimation block than high-motion ones.
|
|
||||||||||||||||||||||||||||
|
|
|
|