September 06, 2006
Development and Optimization Techniques for Multicore ProcessorsTutorial on multi-core threading techniques such as OpenMP, and how to avoid problems like data races and cache conflicts.Max Domeika, Intel Corp.
In this article Max Domeika surveys various multi-core threading techniques such as OpenMP, and discusses some of the challenges when applying threading such as data races and cache conflicts.
Microprocessor design is experiencing a shift away from a predominant
focus on pure performance to a balanced approach that optimizes for
power as well as performance.
Multi-core processors continue this trend and are capable of sharing work and executing tasks on independent execution cores concurrently. In many cases, taking full advantage of the performance benefits of these processors will require developers to thread their applications. Multi-core processors are comprised of multiple processor cores in the same package and offer increased performance, but at a cost to the embedded developer. In many cases, taking advantage of the performance benefits requires developers to thread their applications. Effectively threading an application is a nontrivial task that requires domain knowledge in multi-core architecture, parallelism fundamentals, and a threading methodology. This article is focused on three main topics: (1) an overview of multi-core processor architecture and the benefit that multi-core processors offer; (2) a primer on parallelism and discusses key topics such as scalability, parallelism & threading techniques, and, (3) a threading methodology which can be employed to effectively thread, optimize, and debug an application.
Multi-core Processors
Multi-core Architecture 1) More transistors available per die " As transistor process technology shrinks, more transistors are available for use in a processor design. It is now feasible to place multiple processor cores in the same package. 2) Clock scaling reaching limits due to power constraints " An empirical study suggests that a 1% clock increase results in a 3% power increase. Space, power, and fan noise of cooling is a key customer constraint in many market segments. Figure 1 below provides insight into why multi-core is particularly attractive today. The power and performance of three different configurations running a set of applications are compared. These configurations from left-to-right are:
1. Standard processor
over-clocked 20% The observation for configuration 1 compared to configuration 2 is that performance is increased 1.13x with a power increase of 1.73x. This compares with configuration 3 where performance is increased 1.73x with a power increase of 1.02x. On the right class of applications, multi-core processors offer higher performance with smaller increase in power compared to straight frequency scaling. Two caveats from this observation is that not all applications will be able to take equal advantage of multi-core processors and that more work is required from the developer to realize these performance gains.
Figure 2, below shows two typical multi-core processor architectures. A somewhat simple manner of envisioning a multi-core processor is as two processors connected together inside the chip packaging as opposed to a multiprocessor where multiple processors may be in a system, but connected via the system bus. The two multi-core architectures in Figure 2 differ in the location of connection between the two processor cores.
Figure 2a above shows two independent processor cores, each with a level 2 cache, and sharing system bus and connection to memory. Figure 2b below displays two independent processor cores sharing the level 2 cache, system bus, and connection to memory.
There are advantages and disadvantages of the two layouts; however the current trend is toward shared level 2 cache which offers the following advantages:
1) Size allocated
per core can be dynamically adjusted with the potential of the
total level 2 cache being available to applications that require it.
Benefits of Multi-core Processors The time it takes to process a transaction may be improved by taking advantage of multicore processors. While one processor core is updating the terminal display, another processor core could be tasked with processing the user input. A larger problem domain example would be the servers that handle backend processing of the transactions arriving from the various Point-Of-Sale terminals. By taking advantage of multi-core processor, any one server could handle a greater number of transactions in the desired response time.
Figure 3, above shows possible performance scaling improvements derived from multi-core processors. The area labeled "Supralinear" indicates a performance improvement that scales greater than the number of processors. Supralinear speedup is rare, but possible in cases where the application benefits from the collective increased cache size and layout. The first line shows the theoretical case where the performance scales with the number of processors. The "Typical scaling" curve shows what a sample application may return performance-wise as the number of processors used to run the application increases. This curve displays where there is diminishing performance as the number of processors employed is increased. Less than linear scaling is indicative of either insufficient parallelism or communication overhead of the parallelism. The goal of your effort to employ parallelism is to increase the amount of scaling possible in your application.
Parallelism Primer The implementation of parallelism in a system can take many forms; one commonly used type is shared memory parallelism which implies the following:
1) Multiple
threads execute concurrently. To illustrate the keys to effective parallelism, consider an example in the physical world of multiple workers mowing a lawn. The first consideration is how to divide the work evenly. This even division of labor has the effect of keeping each worker as active as possible. Second, the workers should each have their own lawn mower; not doing so would significantly reduce the effectiveness of the multiple workers. Finally, access to items such as the fuel can and clipping container needs to be coordinated. The keys to parallelism illustrated through this example are generalized as follows:
1) Identify the
concurrent work.
Scalability Scalability with regards to parallelism is a measure of how the performance of an application increases as the number of processor cores in the system increases. Amdahl's law provides the key insight into determining the scalability of an application. In brief, Amdahl's law states that the performance benefit of parallelism is limited by the amount of the application that must run serially, e.g. is not or cannot be parallelized. For example, take an image processing application where 300 images are to be processed with the following characteristics:
1) 10
minutes of
initialization that must run serially Table 1 below shows the calculations of scalability and efficiency for a number of different processing cores. Efficiency is a measure of how effectively the total number of processors is being employed in running the application in parallel; the goal is a 100% measure of efficiency. In other words, to increase the benefits of parallelism and take advantage of the available processor cores as much of the application should be parallelized as possible.
Parallelism Techniques
1. No effort The first level specifies no explicit effort to take advantage of future multi-core processors. Some benefit can be reaped in this case from the ever increasing single thread processor performance being offered in successive generations of processor cores. Second, parallelism can be derived in many cases just by employing a symmetric multiprocessor and multi-core aware operating system such as Linux. System processes have the ability to run concurrently with your application process in this case. Scalability may be limited as the number of cores made available increases; there are only a limited number of tasks an operating system needs to accomplish and can accomplish in parallel. The third level is true multithreading on a symmetric multiprocessor & multi-core aware operating system. This level can take advantage of the parallelism made available at the process level (level 2) as well as parallelism that is expressed in the multiple threads in an application. In order to derive this benefit, it will be required to thread the application.
Decomposition
* Functional
decomposition - division based upon the type of work Functional decomposition attempts to find parallelism between independent steps in your application. For example, consider an intrusion detection system that performs the following checks on a packet stream:
1. Check for
scanning
attacks. As long as each step above was an independent task, it would be possible to apply functional decomposition and execute each step concurrently. Data decomposition involves finding parallelism between data items. For example, the image processing algorithm was an example of data decomposition, where each image (the data) could be processed independently of the other images. In general, it is easier to scale applications with data decomposition as opposed to functional decomposition. Lastly, it is also possible to apply both types of decomposition to a particular problem.
Parallelism Limiters
1. Data dependency - data
values that have dependencies between them. The code shown below (removing the OpenMP pragmas) has a data dependency between iterations on the sum variable and would typically limit running the code in parallel; however there are methods such as using the OpenMP reduction clause that allow parallelism.
An example of state in routines is a routine that maintains data between invocations of the routine using statically defined variables. Examples of routine with state include typical memory allocation routines, random number generators, and I/O routines. There are two general techniques for creating routines and loops that can be parallelized. Routines that are made reentrant are capable of being run in parallel. These routines would not have dependencies between other invocations of the same routines. A method of testing if a particular loop can be parallelized is to run the loop backwards in terms of its loop indexes.
Threading Techniques Table 2 below shows a comparison between explicit threading technologies and compiler-based technologies. Explicit threading tends to be more flexible than compiler-based threading in adapting to the particular needs required to parallelize an application. Explicit threading can parallelize all of the same applications that compiler-based threading can handle and more. Explicit threading tends to be more difficult to use than compiler-based threading. In compiler-based threading, the compiler handles threading of the application with input from the user coming from command line arguments and/or language directives. Explicit threading can perform both functional and data decomposition whereas compiler-based threading is more suited for data decomposition. Finally, explicit threading models such as POSIX have been ported to several operating systems. The availability of compiler-based threading is dependent on compiler support and the underlying threading support.
OpenMP Developers specify parallel regions of code by adding pragmas to the source code. In addition, these pragmas communicate other information such as properties of variables and simple synchronization. The sample code in the appendix is an OpenMP program that calculates the value of pi by summing the area under a curve. The program is very similar to the original serial version of the code except for the addition of a few lines of code. The key line is the following pragma, #pragma omp parallel for reduction (+:sum) private (x) which specifies the for loop should be executed by a team of threads, temporary partial results represented by the sum variable should be aggregated or reduced at the end of the parallel region by addition, and finally the variable x is private, meaning each thread gets its own private copy. The keys in parallelizing the code are summed up as follows:
* Identify the
concurrent work.
The concurrent work is the area calculation
encompassing different parts of the curve
Automatic Parallelization 1) Expose the trip count of loops whenever possible. The compiler has a greater chance of parallelizing loops whose trip counts are statically determinable. 2) Avoid placing function calls inside loop bodies. Function calls may have effects on the loop that cannot be determined at compile time and may prevent parallelization. 3) If available, use optimization reporting options. This option provides a summary of the compiler's analysis of every loop and in cases where a loop cannot be parallelized, a reason as to why not. This is useful in that even if the compiler cannot parallelize the loop, the developer can use the information gained in the report to identify regions for manual threading. 4) If available, adjust the command line option threshold needed for autoparallelization. The compiler estimates how much computation is occurring not occur. This can be overridden by the threshold option. Reduce use of global and static variables. These types of variables make it more difficult for the compiler to identify and parallelize code.
Explicit Threading Typically, a fair amount of modification is required to thread an existing application using explicit threading. The process for threading involves identifying regions to thread, encapsulating the code into functions, and spawning threads based upon these functions. This process makes explicit threading very amenable to functional decomposition. Other advantages over compiler-based threading include the ability to create thread pools that make it possible to queue independent and dynamic tasks without explicit thread creation. In addition, explicit threading allows threads to have priority and affinity which provides a finer level of control for performance and tuning. For example, the supra-linear speedup observed in the threading of an intrusion detection system was partially due to cache benefits made possible by processor affinity.iv Explicit threading is very general and can express many types of concurrency over compiler-based techniques, but does require some knowledge to know when it is best to apply explicit threadingv.
Threading Methodology Serial tuning is less labor intensive and can lead to dramatic performance improvements that can enhance the benefits of parallelizing if done properly. There is a caveat; some serial optimization can limit the ability to parallelize. The best way to consider if overoptimization may inhibit parallelism is to determine if your optimizations are adding the elements that inhibit parallelism defined in the previous Parallelism Limiters section. A classic example of loop optimization that limits parallelism via threading is stripmining which adds dependencies across iterations of a loop. Figure 4 below details the threading development process which is summarized by these four steps:
*Analysis " Determining which portion
of the application to thread.
Step 1: Analysis
* What areas of
the code should be targeted for threading? The target area of code can be determined by using a profiling tool to determine the hot spots in the code. A hot spot is a region of code that executes relatively more frequently than other regions of code. Examples of performance analysis tools that can provide hot spot information include GNU gprof and Intel VTune Performance Analyzer. These tools monitor the execution of an application with a representative input set and provide a correlation between source code and the frequency of execution. Once the most time consuming functions are identified, the next step is to determine if these hot spots can be effectively threaded. Some functions may be very resource intensive, but do not lend themselves to being threaded. This could be due to many reasons like a lack of parallelizable loops, or perhaps the code is so simple it cannot be split into multiple threads. If a region of code identified as a hot spot cannot be threaded, a call graph which provides a visual depiction of the function callers and callees of the application can help. If a specific hot spot is not amenable to threading, the call graph may help identify a function further up the call tree that can be threaded. Threading a function further up the call tree may improve performance by allowing multiple threads to call the hot function simultaneously. The last step in analysis is to estimate the expected speedup from threading the identified regions. For example, suppose a region of code is determined to take 10% of the application runtime. If that region was threaded to execute on a multi-core processor with 4 processor cores, the estimated overall application speedup is 1.08.
Step 2: Design
* What threading technique
should be employed to thread
the application? Guidance on answering these questions is provided in the previous Decomposition section and Threading Techniques section. Once these questions are answered, common good programming practices are employed to implement the code. One word of caution " the threading design should take into account the likelihood of a larger number of processor cores being available in the future " in other words, the design should scale. One common mistake is hard coding the application for the specific number of threads or processor cores available in the current generation of processors.
Step 3: Debug
* What
non-threading related bugs exist in the code and how will they be
addressed? The non-thread related bugs can be handled using common development and debug techniques. Thread related bugs require first an understanding of what they are and then some understanding of how to address them. Many of these are difficult to detect and require extra time and care to ensure a correctly running program. A few of the more common threading issues covered in this article, include:
* Data Race A data race occurs when two or more threads are trying to access the same resource at the same time. If the threads are not communicating effectively, it is impossible to know which thread will access the resource first. This leads to inconsistent results in the running program. For example, in a read/write data race, one thread is attempting to write to a variable at the same time another thread is trying to read the variable. The thread that is reading the variable will get a different result depending on whether or not the write has already occurred. The challenge with a data race is that it is nondeterministic. A program could run correctly one hundred times in a row, but when moved onto the customer's system, which has slightly different system properties, the threads do not line up as they did on the test system and the program fails. The technique to correct a data race is to add synchronization. One way to synchronize access to a common resource is through a critical section. Placing a critical section around a block of code alerts the threads that only one may enter that block of code at a time. This ensures that threads will access that resource in an organized fashion. Synchronization is a necessary and useful technique, but care should be taken to limit unnecessary synchronization as it will slow down performance of the application. Since only one thread is allowed to access a critical section at a time, any other threads needing to access that section are forced to wait. This means precious resources are sitting idle, negatively impacting performance. Another method of ensuring shared resources are correctly accessed is through a lock. In this case, a thread will lock a specific resource while it is using that resource, which also denies access to other threads. Two common threading errors can occur when using locks. The first is a thread stall. This happens when you have one thread that has locked a certain resource and then moves on to other work in the program without first releasing the lock. When a second thread tries to access that resource it is forced to wait for an infinite amount of time, causing a stall. A developer should ensure that threads release their locks before continuing through the program. A deadlock is similar to a stall, but occurs when using a locking hierarchy. If, for example, Thread 1 locks variable A and then wants to lock variable B while Thread 2 is simultaneously locking variable B and then trying to lock variable A, the threads are going to deadlock. Both are trying to access a variable that the other has locked. In general, you should avoid complex locking hierarchies if possible as well as insuring that locks are acquired and released in the same order. The final issue this article covers is false sharing. This is not necessarily an error in the program, but something that is likely to affect performance. False sharing occurs when two threads are manipulating data that lie on the same cache line. On an SMP system, the memory system must ensure cache coherency. This means that modifications to shared cache must be flagged to the memory system so each processor is aware the cache has been modified. When one thread has changed data on that line of cache it causes the cache line to become invalidated. The second thread will have to wait while the cache is reloaded from memory. If this happens repeatedly, for example inside a loop, it will severely affect performance. One way to detect false sharing is to sample on L2 cache misses using a profile tool that supports hardware event-based sampling. If this event occurs frequently in a threaded program, it is likely that false sharing is at fault.
Step 4: Tune
* How much of the program is
running in parallel? The amount of the program which runs in parallel is important because of overall speedup due to parallelization is limited by the amount of serialization. Using a system monitoring tool such as Perfmon or the mpstat command helps provide an estimate of this amount; the developer can observe how much and when the application is taking advantage of the processor cores. Workload balance attempts to keep equal the work shared by the multiple threads.
A tool that can help measure workload balance is the Intel Thread Profiler. Based
upon
these measurements, the developer can determine if another scheduling
method will help keep the workload balanced. The classic example is a
triangular matrix operation where a sharing scheme based upon division
of loop iterations may disadvantage threads with larger data ranges. Synchronization and lock contention is a bit harder to determine using mpstat, however some clues would be a proportion of kernel time and a large number of context switches per second on a multicore system. If synchronization time was observed to be excessive, you could analyze your code to see how to simplify or safely remove some of the synchronization. Max Domeika is a senior staff software engineer in the Developer Products Division at Intel, creating software tools targeting the Intel Architecture market. Max currently provides technical consulting for a variety of products targeting Embedded Intel Architecture.
References
To read other articles on this
issues at Embedded.com, go to More
about Multicore and Multithreading.
|
|
|||||||||||||||||||||||||||||||||||||||||||
|
|
|
|