Why is the first cudaMalloc the only bottleneck?

user3910910 Published at Dev

user3910910

I defined this function :

void cuda_entering_function(...)
{
    StructA *host_input, *dev_input;
    StructB *host_output, *dev_output;

    host_input = (StructA*)malloc(sizeof(StructA));
    host_output = (StructB*)malloc(sizeof(StructB));
    cudaMalloc(&dev_input, sizeof(StructA));
    cudaMalloc(&dev_output, sizeof(StructB));

    ... some more other cudaMalloc()s and cudaMemcpy()s ...

    cudaKernel<< ... >>(dev_input, dev_output);

    ...
}

This function is called several times (about 5 ~ 15 times) throughout my program, and I measured this program's performance using gettimeofday().

Then I found that the bottleneck of cuda_entering_function() is the first cudaMalloc() - the very first cudaMalloc() throughout my whole program. Over 95% of the total execution time of cuda_entering_function() was consumed by the first cudaMalloc(), and this also happens when I changed the size of first cudaMalloc()'s allocating memory or when I changed the executing order of cudaMalloc()s.

What is the reason and is there any way to reduce the first cuda allocating time?

Etienne Pellegrini

The first cudaMalloc is responsible for the initialization of the device too, because it's the first call to any function involving the device. This is why you take such a hit: it's overhead due to the use of CUDA and your GPU. You should make sure that your application can gain a sufficient speedup to compensate for the overhead.

In general, people use a call to an initialization function in order to setup their device. In this answer, you can see that apparently a call to cudaFree(0) is the canonical way to do so. This sample shows the use of cudaSetDevice, which could be a good habit if you ever work on machines with several CUDA-ready devices.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-11

Comments

0 comments

From Dev

Related Related

Article

Why is the first cudaMalloc the only bottleneck?

Why is the first cudaMalloc the only bottleneck?

Why this only works for the first div?

Reverse Proxy: Why response dispatch is not a bottleneck?

Why is only the first window.open working?

Why are double quotes shown only for the first element?

Why is bash only appending first element to array

In guake, why only the first tab is transparent?

Why is strtok printing only first word?

Why .html() and .text() selects only the first word?

why only first letter is returned by match function?

why select * from return only the first field?

Why autofocus property works only on first trial?

Why is only the first word in records being matched?

why it only "see's" the first row?

Why my listview only detects the first checkbox?

In guake, why only the first tab is transparent?

Why is only my first HTTP request running?

why mouseover only works for the first row

Why is only the first line of the condition executed?

Why JQuery function works for only the first textbox

Why does --text="$@" only pass the first word?

Why only the First Element is being added to the div

Why is my for only working on the first variable?

Why is it only the first CSS Selector working?

Why does only the first interrupt work?

Why is strtok printing only first word?

Why autofocus property works only on first trial?

Why setBackground works only the first time? (JPanel)

Why is this program only printing out the first line?

Why is my for loop only grabbing first element?