How to Speed Up Metal Code for iOS/Mac OS

Epsilon

I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const device int& dataSize [[ buffer(1) ]],
                     const device float *filterVector [[ buffer(2) ]],
                     const device int& filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]]) {
    int outputSize = dataSize - filterSize + 1;
    for (int i=0;i<outputSize;i++) {
        float sum = 0.0;
        for (int j=0;j<filterSize;j++) {
            sum += dataVector[i+j] * filterVector[j];
        }
        outVector[i] = sum;
    }
}

My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?

warrenm

The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const constant int &dataSize [[ buffer(1) ]],
                     const constant float *filterVector [[ buffer(2) ]],
                     const constant int &filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]])
{
    float sum = 0.0;
    for (int i = 0; i < filterSize; ++i) {
        sum += dataVector[id + i] * filterVector[i];
    }
    outVector[id] = sum;
}

In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.

// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).

let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)

let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()

In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

Speed up the code

分類Dev

How to speed up Krita?

分類Dev

How to speed up UIImage drawing?

分類Dev

How to speed up MATLAB integration?

分類Dev

Is this lack of speed-up related to the processor or the code?

分類Dev

How to speed up my memory scan program?

分類Dev

how to speed up cross_val_score?

分類Dev

How to speed up intersection of dict of sets in Python

分類Dev

How to speed up JavaScript image loading

分類Dev

How to speed up nested for loop in python

分類Dev

how to speed up uniqBy method in javascript?

分類Dev

How to speed up the deployment to a local glassfish instance?

分類Dev

How do I speed up this for loop in r

分類Dev

How to speed up jQuery on click on the iPad?

分類Dev

How to speed up the restlet client to get response?

分類Dev

How to speed up the web service in Android?

分類Dev

How to speed up functions on numpy arrays

分類Dev

How to speed up (fasta) subsampling program for Python?

分類Dev

How to speed up apply method with lambda in pandas with datetime

分類Dev

How can I speed up nearest neighbor search with python?

分類Dev

How to speed up row deleting function in Google App Script

分類Dev

How do I speed up this nested for loop in Python?

分類Dev

How to speed up debugging C# Azure Function locally? Is

分類Dev

How to speed up query execution in oracle when the tables are joined

分類Dev

How to speed up query execution in oracle when the tables are joined

分類Dev

How to speed up Pandas apply function to create a new column in the dataframe?

分類Dev

How do I speed up scrolling in Windows 7?

分類Dev

How to speed up 'for loop' for searching Pixel value in Image in python?

分類Dev

How can I speed up a Group By statement with multiple Joins?

Related 関連記事

  1. 1

    Speed up the code

  2. 2

    How to speed up Krita?

  3. 3

    How to speed up UIImage drawing?

  4. 4

    How to speed up MATLAB integration?

  5. 5

    Is this lack of speed-up related to the processor or the code?

  6. 6

    How to speed up my memory scan program?

  7. 7

    how to speed up cross_val_score?

  8. 8

    How to speed up intersection of dict of sets in Python

  9. 9

    How to speed up JavaScript image loading

  10. 10

    How to speed up nested for loop in python

  11. 11

    how to speed up uniqBy method in javascript?

  12. 12

    How to speed up the deployment to a local glassfish instance?

  13. 13

    How do I speed up this for loop in r

  14. 14

    How to speed up jQuery on click on the iPad?

  15. 15

    How to speed up the restlet client to get response?

  16. 16

    How to speed up the web service in Android?

  17. 17

    How to speed up functions on numpy arrays

  18. 18

    How to speed up (fasta) subsampling program for Python?

  19. 19

    How to speed up apply method with lambda in pandas with datetime

  20. 20

    How can I speed up nearest neighbor search with python?

  21. 21

    How to speed up row deleting function in Google App Script

  22. 22

    How do I speed up this nested for loop in Python?

  23. 23

    How to speed up debugging C# Azure Function locally? Is

  24. 24

    How to speed up query execution in oracle when the tables are joined

  25. 25

    How to speed up query execution in oracle when the tables are joined

  26. 26

    How to speed up Pandas apply function to create a new column in the dataframe?

  27. 27

    How do I speed up scrolling in Windows 7?

  28. 28

    How to speed up 'for loop' for searching Pixel value in Image in python?

  29. 29

    How can I speed up a Group By statement with multiple Joins?

ホットタグ

アーカイブ