September 24, 2009
How Theory of Constraints Can Help in Software OptimizationCore Architecture Cycle Analysis in Detail
Figure 2 gives you an idea of how to approach cycle decomposition and performance analysis exercise. While Figure 3 is one interpretation of Figure 2 for Core architecture with some of the microarchitectural event names given. This is clearly not the complete breakdown of all the events, but nonetheless a good starting point for further analysis.
Figure 2: Performance Events Drill-Down and Software Tuning Feedback Loop
The idea behind Figure 3 is to identify cycles with the help of VTune Performance Analyzer where: 1) no μops are dispatched for execution, and 2) cycles which are executed but are not retired (due to speculative nature of the processor). The non-retired cycles are basically non-productive cycles and cause ineffective usage of the execution unit.
Figure 3: One interpretation of Figure 2 with some of the microarchitectural event names.
Cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event. Therefore the equation given earlier in Formula 1 can be re-written as given in Formula 2. The ratio of RS_UOPS_DISPATCHED.CYCLES_NONE to CPU_CLK_UNHALTED.CORE will tell you the percentage of cycles wasted due stalls. These very stalls can turn the execution unit of a processor into a major bottleneck. The execution unit by definition is always the bottleneck because it defines the throughput and an application will perform as fast as its bottleneck. Therefore it is extremely critical to identify the causes for the stall cycles and remove them if possible.
Formula 2
Our goal is to determine how we can minimize the causes for the stalls and let the "bottleneck" (i.e, execution unit due to stalls) do to what it is designed to do. In sum, the execution unit should not sit idle and wait for whatever reason.
There are many contributing factors to the stall cycles and sub-optimal usage of the execution unit. Memory accesses (e.g, cache misses), Branch mis-predictions (pipeline flushes as a result), Floating-point (FP) operations (ops) (e.g, long latency operations such as division, fp control word change etc) and μops not retiring due to the out of order (OOO) engine can be given as some of them.
Some of the key events that are used in breaking down the stalls cycles in Figure 3 are given below. VTune can help to sample these events not only on your application but also on the entire system.
Even though OOO engine takes care of small stall penalties (usually anything less 10 cycles), it can be a good exercise to identify the locations of these events to find a correlation with the un-dispatched cycles.
As mentioned above, non-productive cycles (non-retired instruction) utilize the execution unit unnecessarily; thus, identifying and eliminating those instructions is crucial. Although there isn't a direct event to measure the cycles associated with non-retiring μops, Formula 2 and 3 can be used to estimate non-retired cycles by using already performance events.
RS_UOPS_DISPATCHED - (UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED " UOPS_RETIRED.MACRO_FUSION)
Formula 3
Formula 4
|
|
||||||||||||||||||||||||||||||
|
|
|
|