July 25, 2008
CUDA, Supercomputing for the Masses: Part 6Rob Farber
Global memory and the CUDA profiler
Rob Farber is a senior scientist at Pacific Northwest National Laboratory. He has worked in massively parallel computing at several national laboratories and as co-founder of several startups. He can be reached at rmfarber@gmail.com.
In Part 5 of this article series on CUDA (short for "Compute Unified Device Architecture"), I discussed memory performance and the use of shared memory in reverseArray_multiblock_fast.cu. In this installment, I examine global memory using the CUDA profiler
Astute readers of this series timed the two versions of the reverse array example discussed in Part 4 and Part 5 and were puzzled about how the shared memory version is faster than the global memory version. Recall that the shared memory version, reverseArray_multiblock_fast.cu, kernel copies array data from the global memory to the shared memory, then back to global memory while the slower kernel, reverseArray_multiblock.cu, only copies data from global memory to global memory. Since global memory performance is between 100x-150x slower than shared memory, shouldn't the significantly slower global memory performance dominate the runtime of both examples? Why is the shared memory version faster?
Answering this question requires understanding more about global memory plus the use of additional tools from the CUDA development environment -- specifically the CUDA profiler. Profiling CUDA software is fast and easy, as both the text and visual versions of the profiler read hardware profile counters on CUDA-enabled devices. Enabling text profiling is as easy as setting the environmental variables that start and control the profiler. Using the visual profiler is equally easy: Just start cudaprof and start clicking in the GUI. Profiling provides valuable insight. The collection of profile events is handled entirely by hardware within CUDA enabled devices. However, profiled kernels are no longer asynchronous. Reporting of results to the host only occurs after each kernel completes, which minimizes any communications impact.
Global Memory
Understanding how to efficiently use global memory is an essential requirement to becoming an adept CUDA programmer. Following is a brief discussion about global memory that should be sufficient to understand the performance difference between reverseArray_multiblock.cu and reverseArray_multiblock_fast.cu. Future columns will, of necessity, continue to explore efficient uses of global memory. In the meantime, a detailed discussion on global memory, with illustrations, can be found in Section 5.1.2.1 of the CUDA Programming Guide.
Global memory delivers the highest memory bandwidth only when the global memory accesses can be coalesced within a half-warp so the hardware can then fetch (or store) the data in the fewest number of transactions. CUDA Compute Capability devices (1.0 and 1.1) can fetch data in a single 64-byte or 128-byte transaction. If the memory transaction cannot be coalesced, then a separate memory transaction will be issued for each thread in the half-warp, which is undesirable. The performance penalty for non-coalesced memory operations varies according to the size of the data type. The CUDA documentation provides some rough guidelines for the expected performance degradation to expect for various size data types:
Global memory access by all threads in the half-warp of a block can be coalesced into efficient memory transactions on a G80 architecture when:
Newer architectures such as the GT200 family of devices have more relaxed coalescing requirements than those just discussed. I will discuss architectural differences more deeply in a future column. For purposes here, suffice to say that if you tune your code to coalesce well on a G80 CUDA-enabled device, it will coalesce well on a GT200 device.
|
|
||||||||||||||||||||||||||||||
|
|
|
|