CUDAの合計削減のこの実装で間違った結果が得られるのはなぜですか？

debugcn 投稿 Dev

jack87

CUDA C ++ APIを使用して実装されたvector_reductionアルゴリズムの1つのチュートリアルに取り組んでいますが、結果が（デバイス：4386.0000000ホスト：260795.000000）であるため、何が間違っているのか本当に理解できないために苦労しています。

私が使用しているコードは次のとおりです（問題のサイズは512に固定されています）。

編集：残念ながら、問題は解決されておらず、同じ結果が得られます。完全なコードを提供するコードを更新しました。目標は同じで、512要素のfloatの配列のすべての要素を合計します。

    #define NUM_ELEMENTS 512
__global__ void reduction(float *g_data, int n)
{
        __shared__ float s_data[NUM_ELEMENTS];
      int tid = threadIdx.x;
      int index = tid + blockIdx.x*blockDim.x;
      s_data[tid] = 0.0;
      if (index < n){
        s_data[tid] = g_data[index];
      }
      __syncthreads();

      for (int s = 2; s <= blockDim.x; s = s * 2){
        if ((tid%s) == 0){
          s_data[tid] += s_data[tid + s / 2];
        }
        __syncthreads();
      }

      if (tid == 0){
        g_data[blockIdx.x] = s_data[tid];
      }
}


    // includes, system
#include <cuda_runtime.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

// includes, kernels
#include "vector_reduction_kernel.cu"

// For simplicity, just to get the idea in this MP, we're fixing the problem size to 512 elements.
#define NUM_ELEMENTS 512

////////////////////////////////////////////////////////////////////////////////
// declaration, forward
void runTest( int argc, char** argv);

float computeOnDevice(float* h_data, int array_mem_size);

extern "C" 
void computeGold( float* reference, float* idata, const unsigned int len);

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv) 
{

cudaSetDevice(0);
    runTest( argc, argv);
    return EXIT_SUCCESS;
}

////////////////////////////////////////////////////////////////////////////////
//! Run naive scan test
////////////////////////////////////////////////////////////////////////////////
void runTest( int argc, char** argv) 
{
    int num_elements = NUM_ELEMENTS;

    const unsigned int array_mem_size = sizeof( float) * num_elements;

    // Allocate host memory to store the input data
    float* h_data = (float*) malloc( array_mem_size);

    // initialize the input data on the host to be integer values
    // between 0 and 1000
    for( unsigned int i = 0; i < num_elements; ++i) 
        h_data[i] = floorf(1000*(rand()/(float)RAND_MAX));

    // Function to compute the reference solution on CPU using a C sequential version of the algorithm
    // It is written in the file "vector_reduction_gold.cpp". The Makefile compiles this file too.
    float reference = 0.0f;  
    computeGold(&reference , h_data, num_elements);

    // Function to compute the solution on GPU using a call to a CUDA kernel (see body below)
    // The kernel is written in the file "vector_reduction_kernel.cu". The Makefile also compiles this file.
    float result = computeOnDevice(h_data, num_elements);

    // We can use an epsilon of 0 since values are integral and in a range that can be exactly represented
    float epsilon = 0.0f;
    unsigned int result_regtest = (abs(result - reference) <= epsilon);
    printf( "Test %s\n", (1 == result_regtest) ? "Ok." : "No.");
    printf( "device: %f  host: %f\n", result, reference);
    // cleanup memory
    free( h_data);
}

// Function to call the CUDA kernel on the GPU.
// Take h_data from host, copies it to device, setup grid and thread 
// dimensions, excutes kernel function, and copy result of scan back
// to h_data.
// Note: float* h_data is both the input and the output of this function.
float computeOnDevice(float* h_data, int num_elements)
{
  float* d_data = NULL;
  float result;

  // Memory allocation on device side
  cudaMalloc((void**)&d_data, sizeof(float)*num_elements);

  // Copy from host memory to device memory
  cudaMemcpy((void**)&d_data, h_data, num_elements * sizeof(float), cudaMemcpyHostToDevice );

  //int threads = (num_elements/2) + num_elements%2;
  int threads = (num_elements);
  // Invoke the kernel
  reduction<<< 1 ,threads >>>(d_data,num_elements);

  // Copy from device memory back to host memory
  cudaMemcpy(&result, d_data, sizeof(float), cudaMemcpyDeviceToHost);

  cudaFree(d_data);
  cudaDeviceReset();
  return result;
}

float computeOnDevice(float* h_data, int num_elements)
    {
      float* d_data = NULL;
      float result;

      // Memory allocation on device side
      cudaMalloc((void**)&d_data, sizeof(float)*num_elements);

      // Copy from host memory to device memory
      cudaMemcpy(d_data, h_data, num_elements * sizeof(float), cudaMemcpyHostToDevice );

      int threads = (num_elements);

      // Invoke the kernel
      reduction<<< 1 ,threads >>>(d_data,num_elements);

      // Copy from device memory back to host memory
      cudaMemcpy(&result, d_data, sizeof(float), cudaMemcpyDeviceToHost);
      cudaFree(d_data);
      cudaDeviceReset();
      return result;
    }

ロバート・クロベラ

このような質問には、完全なコードを提供する必要があります。また、適切なCUDAエラーチェックを使用し、でコードを実行する必要がありますcuda-memcheck。コードに少なくとも2つのエラーがあります。

私たちはcudaMemcpyこのようなことをしません：
```
  cudaMemcpy((void**)&d_data, h_data, num_elements * sizeof(float), cudaMemcpyHostToDevice );
```
そのはず：
```
  cudaMemcpy(d_data, h_data, num_elements * sizeof(float), cudaMemcpyHostToDevice );
```
最初のパラメーターは単なるポインターであり、ポインターからポインターへのポインターではありません。cuda-memcheckまたは、適切なCUDAエラーチェックにより、この行に注意が向けられます。
十分なスレッドを起動していません。カーネルはスレッドごとに1つの要素をロードします。問題のサイズが512の場合、512のスレッドが必要になります。これは次のとおりです。
```
  int threads = (num_elements/2) + num_elements%2;
```
あなたにそれを取得していません。あなたがそこで何を考えているのかわからない。しかし、これは512の場合にそれを修正することができます：
```
  int threads = (num_elements);
```
削減方法には、2の累乗のスレッドブロックサイズが必要です。

これが完全に機能するテストケースですcuda-memcheck。以下の使用に注意してください。

$ cat t27.cu
#include <stdio.h>
        #define NUM_ELEMENTS 512
    __global__ void reduction(float *g_data, int n)
    {
        __shared__ float s_data[NUM_ELEMENTS];
      int tid = threadIdx.x;
      int index = tid + blockIdx.x*blockDim.x;
      s_data[tid] = 0.0;
      if (index < n){
        s_data[tid] = g_data[index];
      }
      __syncthreads();

      for (int s = 2; s <= blockDim.x; s = s * 2){
        if ((tid%s) == 0){
          s_data[tid] += s_data[tid + s / 2];
        }
        __syncthreads();
      }

      if (tid == 0){
        g_data[blockIdx.x] = s_data[tid];
      }
    }

float computeOnDevice(float* h_data, int num_elements)
    {
      float* d_data = NULL;
      float result;

      // Memory allocation on device side
      cudaMalloc((void**)&d_data, sizeof(float)*num_elements);

      // Copy from host memory to device memory
      cudaMemcpy(d_data, h_data, num_elements * sizeof(float), cudaMemcpyHostToDevice );

      int threads = (num_elements);

      // Invoke the kernel
      reduction<<< 1 ,threads >>>(d_data,num_elements);

      // Copy from device memory back to host memory
      cudaMemcpy(&result, d_data, sizeof(float), cudaMemcpyDeviceToHost);
      cudaFree(d_data);
      cudaDeviceReset();
      return result;
    }


int main(){

   float *data = new float[NUM_ELEMENTS];
   for (int i = 0; i < NUM_ELEMENTS; i++) data[i] = 1;
   float r = computeOnDevice(data, NUM_ELEMENTS);
   printf(" result = %f\n" , r);
}
$ nvcc -arch=sm_35 -o t27 t27.cu
$ cuda-memcheck ./t27
========= CUDA-MEMCHECK
 result = 512.000000
========= ERROR SUMMARY: 0 errors

これがあなたが今投稿したコードの修正版です（これはいくつかの新しい/異なる方法で壊れています）、それは私にとって正しく実行されているようです：

$ cat t30.cu
    #define NUM_ELEMENTS 512
__global__ void reduction(float *g_data, int n)
{
        __shared__ float s_data[NUM_ELEMENTS];
      int tid = threadIdx.x;
      int index = tid + blockIdx.x*blockDim.x;
      s_data[tid] = 0.0;
      if (index < n){
        s_data[tid] = g_data[index];
      }
      __syncthreads();

      for (int s = 2; s <= blockDim.x; s = s * 2){
        if ((tid%s) == 0){
          s_data[tid] += s_data[tid + s / 2];
        }
        __syncthreads();
      }

      if (tid == 0){
        g_data[blockIdx.x] = s_data[tid];
      }
}


    // includes, system
#include <cuda_runtime.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

// includes, kernels

// For simplicity, just to get the idea in this MP, we're fixing the problem size to 512 elements.
#define NUM_ELEMENTS 512

////////////////////////////////////////////////////////////////////////////////
// declaration, forward
void runTest( int argc, char** argv);

float computeOnDevice(float* h_data, int array_mem_size);

extern "C"
void computeGold( float* reference, float* idata, const unsigned int len)
{
  for (int i = 0; i<len; i++) *reference += idata[i];
};

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{

cudaSetDevice(0);
    runTest( argc, argv);
    return EXIT_SUCCESS;
}

////////////////////////////////////////////////////////////////////////////////
//! Run naive scan test
////////////////////////////////////////////////////////////////////////////////
void runTest( int argc, char** argv)
{
    int num_elements = NUM_ELEMENTS;

    const unsigned int array_mem_size = sizeof( float) * num_elements;

    // Allocate host memory to store the input data
    float* h_data = (float*) malloc( array_mem_size);

    // initialize the input data on the host to be integer values
    // between 0 and 1000
    for( unsigned int i = 0; i < num_elements; ++i)
        h_data[i] = floorf(1000*(rand()/(float)RAND_MAX));

    // Function to compute the reference solution on CPU using a C sequential version of the algorithm
    // It is written in the file "vector_reduction_gold.cpp". The Makefile compiles this file too.
    float reference = 0.0f;
    computeGold(&reference , h_data, num_elements);

    // Function to compute the solution on GPU using a call to a CUDA kernel (see body below)
    // The kernel is written in the file "vector_reduction_kernel.cu". The Makefile also compiles this file.
    float result = computeOnDevice(h_data, num_elements);

    // We can use an epsilon of 0 since values are integral and in a range that can be exactly represented
    float epsilon = 0.0f;
    unsigned int result_regtest = (abs(result - reference) <= epsilon);
    printf( "Test %s\n", (1 == result_regtest) ? "CORRECTO: Coinciden los resultados de la CPU y la GPU" : "INCORRECTO: Los resultados calculados en paralelo en la GPU no coinciden con los obtenidos secuencialmente en la CPU");
    printf( "device: %f  host: %f\n", result, reference);
    // cleanup memory
    free( h_data);
}

// Function to call the CUDA kernel on the GPU.
// Take h_data from host, copies it to device, setup grid and thread
// dimensions, excutes kernel function, and copy result of scan back
// to h_data.
// Note: float* h_data is both the input and the output of this function.
#if 0
float computeOnDevice(float* h_data, int num_elements)
{
  float* d_data = NULL;
  float result;

  // Memory allocation on device side
  cudaMalloc((void**)&d_data, sizeof(float)*num_elements);

  // Copy from host memory to device memory
  cudaMemcpy((void**)&d_data, h_data, num_elements * sizeof(float), cudaMemcpyHostToDevice );

  //int threads = (num_elements/2) + num_elements%2;
  int threads = (num_elements);
  // Invoke the kernel
  reduction<<< 1 ,threads >>>(d_data,num_elements);

  // Copy from device memory back to host memory
  cudaMemcpy(&result, d_data, sizeof(float), cudaMemcpyDeviceToHost);

  cudaFree(d_data);
  cudaDeviceReset();
  return result;
}
#endif
float computeOnDevice(float* h_data, int num_elements)
    {
      float* d_data = NULL;
      float result;

      // Memory allocation on device side
      cudaError_t err = cudaMalloc((void**)&d_data, sizeof(float)*num_elements);
      if (err != cudaSuccess) {printf("CUDA error: %s\n", cudaGetErrorString(err)); exit(0);}
      // Copy from host memory to device memory
      cudaMemcpy(d_data, h_data, num_elements * sizeof(float), cudaMemcpyHostToDevice );

      int threads = (num_elements);

      // Invoke the kernel
      reduction<<< 1 ,threads >>>(d_data,num_elements);

      // Copy from device memory back to host memory
      cudaMemcpy(&result, d_data, sizeof(float), cudaMemcpyDeviceToHost);
      cudaFree(d_data);
      err = cudaGetLastError();
      if (err != cudaSuccess) {printf("CUDA error: %s\n", cudaGetErrorString(err)); exit(0);}
      cudaDeviceReset();
      return result;
    }
$ nvcc -arch=sm_35 -o t30 t30.cu
$ cuda-memcheck ./t30
========= CUDA-MEMCHECK
Test CORRECTO: Coinciden los resultados de la CPU y la GPU
device: 260795.000000  host: 260795.000000
========= ERROR SUMMARY: 0 errors
$

コードに適切なCUDAエラーチェックをまだ追加していないため、マシンのセットアップに問題がある可能性があります。それでも問題が解決しない場合は、基本的なエラーチェックを入れているので、上記で投稿した正確なコードを実行することをお勧めします。

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-2

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

CUDAの合計削減のこの実装で間違った結果が得られるのはなぜですか？

CUDAの合計削減のこの実装で間違った結果が得られるのはなぜですか？

（Leetcode＃231）Power of Twoの実装で間違った結果が得られるのはなぜですか？

合計時間の計算式で間違った結果が生成されるのはなぜですか？

各桁の4乗の合計を計算すると、なぜ間違った結果が得られるのですか？

cudaに2つの数値を追加すると間違った結果が得られるのはなぜですか？

この計算で間違った結果が生成されるのはなぜですか？

なぜ間違った出力が得られるのですか？

Haskellで16進数から12進数に変換しようとすると、間違った結果が得られるのはなぜですか？

JavaScriptのRegExpでこの結果が得られるのはなぜですか？

なぜここで異なる結果が得られたのですか？

無限ループを使用して符号なし整数を追加すると、間違った結果が得られるのはなぜですか？

合計と統合で異なる結果が得られるのはなぜですか？

統計が間違った結果を返すのはなぜですか？

この結合を理解できず、間違った結果が得られます

このSPARQLクエリが間違った結果をもたらすのはなぜですか？

ペアワイズ合計では、かなり間違った結果を得るにはいくつの項が必要ですか？

RavenDBが間違った合計結果カウントを返すのはなぜですか？

この同期されたプログラムが間違った結果を返すのはなぜですか？

12時間から24時間の変換では、間違った結果が得られます

このJavaアプリケーションで年、月、日から始まる新しい日付を作成すると、間違った結果が得られるのはなぜですか？

このHaskell式で結果が得られないのはなぜですか？

OpenMPプログラムから誤った結果が得られるのはなぜですか？

ダート-いくつかのdouble値を減算すると、間違った結果が得られます

マチンの公式を使用して円周率の値を計算すると、間違った値が得られるのはなぜですか？

LongStreamの削減と合計のパフォーマンスに違いがあるのはなぜですか？

Excel で別の日付と時刻から日付と時刻を減算すると、間違った結果が得られる

プロローグの `length / 2`実装で重複した結果が得られるのはなぜですか？

正規表現で目的の結果が得られなかったのはなぜですか

ループを実行するたびに同じ結果が得られないのはなぜですか？

このAJAXとPHPの検証で期待した結果が得られないのはなぜですか