How to Speed Up Metal Code for iOS/Mac OS

debugcn 投稿 Dev

Epsilon

I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const device int& dataSize [[ buffer(1) ]],
                     const device float *filterVector [[ buffer(2) ]],
                     const device int& filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]]) {
    int outputSize = dataSize - filterSize + 1;
    for (int i=0;i<outputSize;i++) {
        float sum = 0.0;
        for (int j=0;j<filterSize;j++) {
            sum += dataVector[i+j] * filterVector[j];
        }
        outVector[i] = sum;
    }
}

My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?

warrenm

The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const constant int &dataSize [[ buffer(1) ]],
                     const constant float *filterVector [[ buffer(2) ]],
                     const constant int &filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]])
{
    float sum = 0.0;
    for (int i = 0; i < filterSize; ++i) {
        sum += dataVector[id + i] * filterVector[i];
    }
    outVector[id] = sum;
}

In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.

// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).

let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)

let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()

In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-05-29

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

How to Speed Up Metal Code for iOS/Mac OS

How to Speed Up Metal Code for iOS/Mac OS

Speed up the code

How to speed up Krita?

How to speed up UIImage drawing?

How to speed up MATLAB integration?

Is this lack of speed-up related to the processor or the code?

How to speed up my memory scan program?

how to speed up cross_val_score?

How to speed up intersection of dict of sets in Python

How to speed up JavaScript image loading

How to speed up nested for loop in python

how to speed up uniqBy method in javascript?

How to speed up the deployment to a local glassfish instance?

How do I speed up this for loop in r

How to speed up jQuery on click on the iPad?

How to speed up the restlet client to get response?

How to speed up the web service in Android?

How to speed up functions on numpy arrays

How to speed up (fasta) subsampling program for Python?

How to speed up apply method with lambda in pandas with datetime

How can I speed up nearest neighbor search with python?

How to speed up row deleting function in Google App Script

How do I speed up this nested for loop in Python?

How to speed up debugging C# Azure Function locally? Is

How to speed up query execution in oracle when the tables are joined

How to speed up query execution in oracle when the tables are joined

How to speed up Pandas apply function to create a new column in the dataframe?

How do I speed up scrolling in Windows 7?

How to speed up 'for loop' for searching Pixel value in Image in python?

How can I speed up a Group By statement with multiple Joins?