Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

CUDA, Supercomputing for the Masses: Part 5


In Part 4 of this article series on CUDA, I discussed how the execution model and kernel launch execution configuration affects the number of registers and amount of local multiprocessor resources such as shared memory. In this installment, I continue with a discussion of memory performance and the use of shared memory in reverseArray_multiblock_fast.cu.

CUDA Memory Performance

The local and global memory spaces are not cached which means each memory access to global memory (or local memory) generates an explicit memory access. So what does it cost to access (read or write, for example) each of the different memory types?

A multiprocessor takes four clock cycles to issue one memory instruction for a "warp". Accessing local or global memory incurs an additional 400 to 600 clock cycles of memory latency. As an example, the assignment operator in the code snippet below takes four clock cycles to issue a read from global memory, four clock cycles to issue a write to shared memory, and 400 to 600 clock cycles to read a float from global memory. Note: the __device__ variable type qualifier is used to denote a variable that resides in global memory (among other variable characteristics; see section 4.2.2.1 of the CUDA Programming Guide for more information). Variables of type __device__ cannot be accessed by host code.

	
__shared__ float shared[32];
__device__ float device[32];
shared[threadIdx.x] = device[threadIdx.x];

With a factor of 100x-150x difference in access time, it is no surprise that developers need to minimize accesses to global memory and reuse data within the local multiprocessor memories. The CUDA designers have done a good job with the thread scheduler; so much of the global memory latency can be transparently hidden just by specifying large numbers of blocks in the execution configuration and working as much as possible with variables with register, __shared__, and __constant__ memory types in the kernel.

Since shared memory is on chip accesses are significantly faster than accesses to global memory and the main optimization is avoiding bank conflicts. Shared memory is fast (some documentation indicates it is as fast as register accesses). However, recent large improvements in CUBLAS and CUFFT performance were achieved by avoiding shared memory in favor of registers -- so try to use registers whenever possible. CUDA shared memory is divided into equally-sized memory modules that are called memory banks. Each memory bank holds a successive 32-bit value (like an int or float) so consecutive array accesses by consecutive threads are very fast. Bank conflicts occur when multiple requests are made for data from the same bank (either the same address or multiple addresses that map to the same bank). When this happens, the hardware effectively serializes the memory operations, which forces all the threads to wait until all the memory requests are satisfied. If all threads read from the same shared memory address then a broadcast mechanism is automatically invoked and serialization is avoided. Shared memory broadcasts are an excellent and high-performance way to get data to many threads simultaneously. It is worthwhile trying to exploit this feature whenever you use shared memory.

I will discuss bank conflicts in greater detail in a future column. For the moment, suffice it to say that reverseArray_multiblock_fast.cu has no bank conflicts because consecutive threads access consecutive values.

A quick summary of local multiprocessor memory types with read/write capability follows:

  • Registers:
    • The fastest form of memory on the multi-processor.
    • Is only accessible by the thread.
    • Has the lifetime of the thread.
  • Shared Memory:
    • Can be as fast as a register when there are no bank conflicts or when reading from the same address.
    • Accessible by any thread of the block from which it was created.
    • Has the lifetime of the block.
  • Global memory:
    • Potentially 150x slower than register or shared memory -- watch out for uncoalesced reads and writes which will be discussed in the next column.
    • Accessible from either the host or device.
    • Has the lifetime of the application.
  • Local memory:
    • A potential performance gotcha, it resides in global memory and can be 150x slower than register or shared memory.
    • Is only accessible by the thread.
    • Has the lifetime of the thread.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.