如何将CUDA核函数中的内核输入数据结构与pycuda中的参数输入相关联

debugcn 发表于 Dev

ted930511

我正在编写一个 cuda 内核来将 rgba 图像转换为 pycuda 中的灰度图像，这里是 PyCUDA 代码：

import numpy as np
import matplotlib.pyplot as plt
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
kernel = SourceModule("""
#include <stdio.h>
__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
                   unsigned char* const greyImage,
                   int numRows, int numCols)
{
  int y = threadIdx.y+ blockIdx.y* blockDim.y;
  int x = threadIdx.x+ blockIdx.x* blockDim.x;
  if (y < numCols && x < numRows) {
    int index = numRows*y +x;
    uchar4 color = rgbaImage[index];
    unsigned char grey = (unsigned char)(0.299f*color.x+ 0.587f*color.y + 
    0.114f*color.z);
    greyImage[index] = grey;
 }
}
""")

然而，问题是如何将 uchar4* 与 numpy 数组联系起来。我知道可以修改我的内核函数以接受 int* 或 float*，并使其工作。但我只是想知道如何使上述内核函数在 pycuda 中工作。

下面是主机代码。

def gpu_rgb2gray(image):
    shape = image.shape
    n_rows, n_cols, _ = np.array(shape, dtype=np.int)
    image_gray = np.empty((n_rows, n_cols), dtype= np.int)
    ## HERE is confusing part, how to rearrange image to match unchar4* ??
    image = image.reshape(1, -1, 4)
    # Get kernel function
    rgba2gray = kernel.get_function("rgba_to_greyscale")
    # Define block, grid and compute
    blockDim = (32, 32, 1) # 1024 threads in total
    dx, mx = divmod(shape[1], blockDim[0])
    dy, my = divmod(shape[0], blockDim[1])
    gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
    # Kernel function
    # HERE doesn't work because of mismatch
    rgba2gray (
        cuda.In(image), cuda.Out(image_gray), n_rows, n_cols,
        block=blockDim, grid=gridDim)
    return image_gray

有人有主意吗？谢谢！

看门人

本gpuarray类有CUDA的内置矢量类型（包括原生支持uchar4）。

因此，您可以为内核创建具有正确 dtype 的 gpuarray 实例，并使用缓冲区将主机映像复制到该 gpuarray，然后使用 gpuarray 作为内核输入参数。作为一个例子（如果我正确理解你的代码），这样的事情应该可以工作：

import pycuda.gpuarray as gpuarray

....

def gpu_rgb2gray(image):
    shape = image.shape
    image_rgb = gpuarray.empty(shape, dtype=gpuarray.vec.uchar4)
    cuda.memcpy_htod(image_rgb.gpudata, image.data)
    image_gray = gpuarray.empty(shape, dtype=np.uint8)

    # Get kernel function
    rgba2gray = kernel.get_function("rgba_to_greyscale")
    # Define block, grid and compute
    blockDim = (32, 32, 1) # 1024 threads in total
    dx, mx = divmod(shape[1], blockDim[0])
    dy, my = divmod(shape[0], blockDim[1])
    gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
    rgba2gray ( image_rgb, image_gray, np.int32(shape[0]), np.int32(shape[1]), block=blockDim, grid=gridDim)

    img_gray = np.array(image_gray.get(), dtype=np.int)

    return img_gray

这将获取 32 位无符号整数的图像并将它们复制到uchar4GPU 上的数组，然后将生成的数组向上转换uchar回设备上的整数。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。