CUDA Constant Memory Best Practices

Psypher

I present here some code

__constant__ int array[1024];

__global__ void kernel1(int *d_dst) {
   int tId = threadIdx.x + blockIdx.x * blockDim.x;
   d_dst[tId] = array[tId];
}

__global__ void kernel2(int *d_dst, int *d_src) {
   int tId = threadIdx.x + blockIdx.x * blockDim.x;
   d_dst[tId] = d_src[tId];
}

int main(int argc, char **argv) {
   int *d_array;
   int *d_src;
   cudaMalloc((void**)&d_array, sizeof(int) * 1024);
   cudaMalloc((void**)&d_src, sizeof(int) * 1024);

   int *test = new int[1024];
   memset(test, 0, sizeof(int) * 1024);

   for (int i = 0; i < 1024; i++) {
     test[i] = 100;
   }

   cudaMemcpyToSymbol(array, test, sizeof(int) * 1024);
   kernel1<<< 1, 1024 >>>(d_array);

   cudaMemcpy(d_src, test, sizeof(int) * 1024, cudaMemcpyHostToDevice);
   kernel2<<<1, 32 >>>(d_array, d_src),

   free(test);
   cudaFree(d_array);
   cudaFree(d_src);

   return 0;
}

Which simply shows constant memory and global memory usage. On its execution the "kernel2" executes about 4 times faster (in terms of time) than "kernel1"

I understand from the Cuda C programming guide, that this this because accesses to constant memory are getting serialized. Which brings me to the idea that constant memory can be best utilized if a warp accesses a single constant value such as integer, float, double etc. but accessing an array is not beneficial at all. In other terms, I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?

I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more? I mean a structure might contain multiple simple types and array for example; when accessing these simple types, are these accesses also serialized or not?

Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?

Anyone can refer me some example code where an efficient constant memory usage is shown.

regards,

Robert Crovella

I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?

Yes this is generally correct and is the principal intent of usage of constant memory/constant cache. The constant cache can serve up one quantity per SM "at a time". The precise wording is as follows:

The constant memory space resides in device memory and is cached in the constant cache.

A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.

The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

An important takeaway from the text above is the desire for uniform access across a warp to achieve best performance. If a warp makes a request to __constant__ memory where different threads in the warp are accessing different locations, those requests will be serialized. Therefore if each thread in a warp is accessing the same value:

int i = array[20];

then you will have the opportunity for good benefit from the constant cache/memory. If each thread in a warp is accessing a unique quantity:

int i = array[threadIdx.x]; 

then the accesses will be serialized, and the constant data usage will be disappointing, performance-wise.

I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more?

You can certainly put structures in constant memory. The same rules apply:

int i = constant_struct_ptr->array[20]; 

has the opportunity to benefit, but

int i = constant_struct_ptr->array[threadIdx.x];

does not. If you access the same simple type structure element across threads, that is ideal for constant cache usage.

Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?

Yes, if you know that in general your accesses will break the constant memory one 32-bit quantity per cycle rule, then you'll probably be better off leaving the data in ordinary global memory.

There are a variety of cuda sample codes that demonstrate usage of __constant__ data. Here are a few:

  1. graphics volumeRender
  2. imaging bilateralFilter
  3. imaging convolutionTexture
  4. finance MonteCarloGPU

and there are others.

EDIT: responding to a question in the comments, if we have a structure like this in constant memory:

struct Simple { int a, int b, int c} s;

And we access it like this:

int p = s.a + s.b + s.c;
          ^     ^     ^
          |     |     |
cycle:    1     2     3

We will have good usage of the constant memory/cache. When the C code gets compiled, under the hood it will generate machine code accesses corresponding to 1,2,3 in the diagram above. Let's imagine that access 1 occurs first. Since access 1 is to the same memory location independent of which thread in the warp, during cycle 1, all threads will receive the value in s.a and it will take advantage of the cache for best possible benefit. Likewise for accesses 2 and 3. If on the other hand we had:

struct Simple { int a[32], int b[32], int c[32]} s;
...
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int p = s.a[idx] + s.b[idx] + s.c[idx];

This would not give good usage of constant memory/cache. Instead, if this were typical of our accesses to s, we'd probably have better performance locating s in ordinary global memory.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

CUDA Constant Memory Error

From Dev

CUDA Constant Memory Error

From Dev

constant memory size in CUDA

From Dev

Cuda global to shared memory and constant memory

From Dev

CUDA 5.0 namespaces for constant memory variable usage

From Dev

Templates for CUDA applications that use constant memory

From Dev

compile constant memory array to immediate value in CUDA

From Dev

How to use constant memory for beginners (Cuda C)

From Dev

Is there any way to dynamically allocate constant memory? CUDA

From Dev

Correct way to use __constant__ memory on CUDA?

From Dev

compile constant memory array to immediate value in CUDA

From Dev

Best practices to optimize memory in C#

From Dev

Best practices to optimize memory in C#

From Dev

Memory leak in constant memory of CUDA-capable GPUs?

From Dev

CUDA __constant__ deference to global memory. Which cache?

From Dev

CUDA constant memory issue: invalid device symbol with cudaGetSymbolAddress

From Dev

What are some best practices to build memory-efficient Java applications?

From Dev

Backbone + Marionette - Memory managment best practices for Models/Collections?

From Dev

Backbone + Marionette - Memory managment best practices for Models/Collections?

From Dev

How does one transfer CUDA constant memory in tensorflow's C++ API

From Dev

LNK2005 Error when using Constant Memory in CUDA Header (.cuh) File

From Dev

How does one transfer CUDA constant memory in tensorflow's C++ API

From Dev

Best Practices in Serverless Framework

From Dev

Logging best practices and thoughts

From Dev

Best practices for talking to an API

From Dev

Best practices for bower components

From Dev

Design pattern best practices

From Dev

Regex best-practices

From Java

API pagination best practices

Related Related

  1. 1

    CUDA Constant Memory Error

  2. 2

    CUDA Constant Memory Error

  3. 3

    constant memory size in CUDA

  4. 4

    Cuda global to shared memory and constant memory

  5. 5

    CUDA 5.0 namespaces for constant memory variable usage

  6. 6

    Templates for CUDA applications that use constant memory

  7. 7

    compile constant memory array to immediate value in CUDA

  8. 8

    How to use constant memory for beginners (Cuda C)

  9. 9

    Is there any way to dynamically allocate constant memory? CUDA

  10. 10

    Correct way to use __constant__ memory on CUDA?

  11. 11

    compile constant memory array to immediate value in CUDA

  12. 12

    Best practices to optimize memory in C#

  13. 13

    Best practices to optimize memory in C#

  14. 14

    Memory leak in constant memory of CUDA-capable GPUs?

  15. 15

    CUDA __constant__ deference to global memory. Which cache?

  16. 16

    CUDA constant memory issue: invalid device symbol with cudaGetSymbolAddress

  17. 17

    What are some best practices to build memory-efficient Java applications?

  18. 18

    Backbone + Marionette - Memory managment best practices for Models/Collections?

  19. 19

    Backbone + Marionette - Memory managment best practices for Models/Collections?

  20. 20

    How does one transfer CUDA constant memory in tensorflow's C++ API

  21. 21

    LNK2005 Error when using Constant Memory in CUDA Header (.cuh) File

  22. 22

    How does one transfer CUDA constant memory in tensorflow's C++ API

  23. 23

    Best Practices in Serverless Framework

  24. 24

    Logging best practices and thoughts

  25. 25

    Best practices for talking to an API

  26. 26

    Best practices for bower components

  27. 27

    Design pattern best practices

  28. 28

    Regex best-practices

  29. 29

    API pagination best practices

HotTag

Archive