Increasing the Efficiency of Out-of-order Processing
A traditional challenge in speeding up memory access is the ambiguity inherent in prefetching data from memory. Ambiguity is one of the main reasons there is latency in out-of-order processing.
New, advanced, memory disambiguation resolves this by providing execution cores with the built-in intelligence to speculatively load data for instructions that are about to execute--before all previous store instructions are completed.
In implementations without memory disambiguation, each load instruction that needs to read data from main memory must wait until all previous store instructions are completed before it can read that data in. Loads can't be rescheduled ahead of stores because the microprocessor doesn't know if it might violate data-location dependencies. Yet in many cases, loads don't depend on a previous store.
Memory disambiguation uses special, intelligent algorithms to evaluate whether or not a load can be executed ahead of a preceding store. If the system intelligently speculates that it can prefetch the data, then the load instructions are scheduled before the store instructions. The processor spends less time waiting and more time processing. To avoid putting additional requirements on the system, disambiguation is done during periods when the system bus and memory subsystems have spare bandwidth available.
In the rare event that a load is invalid, memory disambiguation has built-in intelligence to detect the conflict, reload the correct data, and reexecute the instruction.
Memory disambiguation is a sophisticated technique that helps avoid the wait states imposed by less capable microarchitectures. The result is faster execution and more efficient use of processor resources.
Doubling the Number of Prefetchers
Microarchitectures based on the new 65-nm process are also doubling the number of advanced prefetchers available per cache. Prefetchers do just that--they "prefetch" memory contents before the data is requested, so the data can be placed in cache and readily accessed when needed. By increasing the number of loads that occur from cache as opposed to main memory, these microarchitectures reduce memory latency and improve performance.
Specifically, to ensure data is where each execution core needs it, there are now two prefetchers per L1 cache and two prefetchers per L2 cache. These prefetchers detect multiple streaming and strided access patterns simultaneously. This lets them ready data in the L1 cache for "just-in-time" execution. The prefetchers for the L2 cache analyze accesses from cores to help make sure the L2 cache holds the data which the cores may need in the future.
The combination of advanced prefetchers and memory disambiguation delivers significantly improved execution throughput. The result is better performance through the highest possible instruction-level parallelism.