June 23, 2009
CUDA, Supercomputing for the Masses: Part 13Linear Memory
It is important to distinguish between "linear memory" created with cudaMalloc() and "pitch linear" memory created with cudaMallocPitch(). In a nutshell, both methods create linear memory but cudaMallocPitch() pads the allocation to get best performance for the memory subsystem of a given piece of hardware. A programmer can create memory with cudaMalloc() and manually set the pitch, but best memory performance may not be achieved. Aside from the ability to update, there is little difference between a texture bound to a 2D CUDA array and pitch linear memory. NVIDIA has indicated there is effectively no difference between textures bound to these two types of memory.
There are two cases to consider when binding the texture to global memory that necessitated distinguishing between pitch linear memory and linear memory:
The use of cudaMallocPitch() is generally recommended because it "knows" what pitch is appropriate for a given piece of hardware to get the best performance and is a good way to future-proof your code.
Also note that CUDA arrays are an opaque data storage mechanism, and are composed of elements, each of which has 1, 2 or 4 components that may be signed or unsigned 8-, 16- or 32-bit integers, 16-bit floats (CUDA driver only), or 32-bit floats. You can also use int2 and hiloint2double to use double-precision values. Note that CUDA arrays may get reordered for locality on the GPU.
Binding memory to a texture is quite fast and unlikely to have an appreciable impact on program performance. There are some restrictions and additional caveats:
As noted, the texture cache is optimized for 2D spatial locality with streaming behavior when bound to pitch linear memory or CUDA arrays. However, this does not tell us how to order the data to get the best performance out of the texture unit when using it as a cache. There is a good thread on the CUDA Zone forums discussing how to order 3D data for best performance. One of the suggestions is to use a Z-order curve to map multidimensional data to 1D while preserving locality. The challenging question of how to best order your data for cache locality is further complicated because the methods used by the graphics hardware may change at some point in the future to better meet customer needs.
Depending on how the global memory bound to the texture was created, there are several possible ways to fetch from the texture that might also invoke some form of texture processing by the texture.
The simplest way to fetch data from a texture is by using tex1Dfetch() because:
Use of the methods tex1D(), tex2D(), and tex3D() are more complicated because the interpretation of the texture coordinates, what processing occurs during the texture fetch, and the return value delivered by the texture fetch are all controlled by setting the texture reference's mutable (runtime) and immutable (compile time) attributes:
For more detailed information, please consult the CUDA Programming Guide.
By default, textures are referenced using floating-point coordinates in the range [0, N) where N is the size of the texture in the dimension corresponding to the coordinate. Specifying normalized texture coordinates will be used implies all references will be in the range [0,1).
The wrap mode specifies what happens for out-of-bounds addressing:
|
|
||||||||||||||||||||||||||||||
|
|
|
|