如何加速 iOS/Mac OS 的 Metal 代码
How to Speed Up Metal Code for iOS/Mac OS
我正在尝试在 Metal 中实现代码,该代码在两个具有长度的向量之间执行一维卷积。我已经实施了以下正常工作
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const device int& dataSize [[ buffer(1) ]],
const device float *filterVector [[ buffer(2) ]],
const device int& filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]]) {
int outputSize = dataSize - filterSize + 1;
for (int i=0;i<outputSize;i++) {
float sum = 0.0;
for (int j=0;j<filterSize;j++) {
sum += dataVector[i+j] * filterVector[j];
}
outVector[i] = sum;
}
}
我的问题是使用 Metal 处理(计算 + 数据传输 to/from GPU)相同数据的时间比 CPU 上的 Swift 多 10 倍。我的问题是如何用单个向量运算替换内部循环,或者是否有其他方法可以加速上述代码?
在这种情况下利用 GPU 并行性的关键是让它为您管理外循环。我们不会为整个数据向量调用一次内核,而是为数据向量中的每个元素调用它。内核函数简化为:
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const constant int &dataSize [[ buffer(1) ]],
const constant float *filterVector [[ buffer(2) ]],
const constant int &filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]])
{
float sum = 0.0;
for (int i = 0; i < filterSize; ++i) {
sum += dataVector[id + i] * filterVector[i];
}
outVector[id] = sum;
}
为了分派这项工作,我们 select 基于计算管道状态推荐的线程执行宽度的线程组大小。这里有一件棘手的事情是确保输入和输出缓冲区中有足够的填充,以便我们可以略微超出数据的实际大小。这确实导致我们浪费了少量的内存和计算,但为我们节省了进行单独分派的复杂性,只是为了计算缓冲区末尾元素的卷积。
// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).
let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
在我的实验中,这种并行化方法比问题中的串行版本运行 400-1000 倍。我很想知道它与您的 CPU 实施相比如何。
下面的代码展示了如何使用Objective-C Metal API在GPU上并行渲染编码命令(上面的线程代码只是将输出的渲染分成网格部分进行并行处理;计算仍未并行执行)。这就是您在问题中所指的内容,即使它并不完全是您想要的。我提供这个答案是为了帮助那些可能偶然发现这个问题的人,认为它会提供与并行渲染相关的答案(实际上,它没有):
- (void)drawInMTKView:(MTKView *)view
{
dispatch_async(((AppDelegate *)UIApplication.sharedApplication.delegate).cameraViewQueue, ^{
id <CAMetalDrawable> drawable = [view currentDrawable]; //[(CAMetalLayer *)view.layer nextDrawable];
MTLRenderPassDescriptor *renderPassDesc = [view currentRenderPassDescriptor];
renderPassDesc.colorAttachments[0].loadAction = MTLLoadActionClear;
renderPassDesc.colorAttachments[0].clearColor = MTLClearColorMake(0.0,0.0,0.0,1.0);
renderPassDesc.renderTargetWidth = self.texture.width;
renderPassDesc.renderTargetHeight = self.texture.height;
renderPassDesc.colorAttachments[0].texture = drawable.texture;
if (renderPassDesc != nil)
{
dispatch_semaphore_wait(self._inflight_semaphore, DISPATCH_TIME_FOREVER);
id <MTLCommandBuffer> commandBuffer = [self.metalContext.commandQueue commandBuffer];
[commandBuffer enqueue];
// START PARALLEL RENDERING OPERATIONS HERE
id <MTLParallelRenderCommandEncoder> parallelRCE = [commandBuffer parallelRenderCommandEncoderWithDescriptor:renderPassDesc];
// FIRST PARALLEL RENDERING OPERATION
id <MTLRenderCommandEncoder> renderEncoder = [parallelRCE renderCommandEncoder];
[renderEncoder setRenderPipelineState:self.metalContext.renderPipelineState];
[renderEncoder setVertexBuffer:self.metalContext.vertexBuffer offset:0 atIndex:0];
[renderEncoder setVertexBuffer:self.metalContext.uniformBuffer offset:0 atIndex:1];
[renderEncoder setFragmentBuffer:self.metalContext.uniformBuffer offset:0 atIndex:0];
[renderEncoder setFragmentTexture:self.texture
atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
vertexStart:0
vertexCount:4
instanceCount:1];
[renderEncoder endEncoding];
// ADD SECOND, THIRD, ETC. PARALLEL RENDERING OPERATION HERE
.
.
.
// SUBMIT ALL RENDERING OPERATIONS IN PARALLEL HERE
[parallelRCE endEncoding];
__block dispatch_semaphore_t block_sema = self._inflight_semaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
dispatch_semaphore_signal(block_sema);
}];
if (drawable)
[commandBuffer presentDrawable:drawable];
[commandBuffer commit];
[commandBuffer waitUntilScheduled];
}
});
}
在上面的示例中,您将为要并行执行的每个计算复制与 renderEncoder 相关的内容。在您的代码示例中,我看不出这对您有什么好处,因为一个操作似乎依赖于另一个操作。那么,您可能希望得到的最好结果就是 warrenm 提供给您的代码,尽管这并不真正符合并行渲染的条件。
我正在尝试在 Metal 中实现代码,该代码在两个具有长度的向量之间执行一维卷积。我已经实施了以下正常工作
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const device int& dataSize [[ buffer(1) ]],
const device float *filterVector [[ buffer(2) ]],
const device int& filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]]) {
int outputSize = dataSize - filterSize + 1;
for (int i=0;i<outputSize;i++) {
float sum = 0.0;
for (int j=0;j<filterSize;j++) {
sum += dataVector[i+j] * filterVector[j];
}
outVector[i] = sum;
}
}
我的问题是使用 Metal 处理(计算 + 数据传输 to/from GPU)相同数据的时间比 CPU 上的 Swift 多 10 倍。我的问题是如何用单个向量运算替换内部循环,或者是否有其他方法可以加速上述代码?
在这种情况下利用 GPU 并行性的关键是让它为您管理外循环。我们不会为整个数据向量调用一次内核,而是为数据向量中的每个元素调用它。内核函数简化为:
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const constant int &dataSize [[ buffer(1) ]],
const constant float *filterVector [[ buffer(2) ]],
const constant int &filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]])
{
float sum = 0.0;
for (int i = 0; i < filterSize; ++i) {
sum += dataVector[id + i] * filterVector[i];
}
outVector[id] = sum;
}
为了分派这项工作,我们 select 基于计算管道状态推荐的线程执行宽度的线程组大小。这里有一件棘手的事情是确保输入和输出缓冲区中有足够的填充,以便我们可以略微超出数据的实际大小。这确实导致我们浪费了少量的内存和计算,但为我们节省了进行单独分派的复杂性,只是为了计算缓冲区末尾元素的卷积。
// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).
let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
在我的实验中,这种并行化方法比问题中的串行版本运行 400-1000 倍。我很想知道它与您的 CPU 实施相比如何。
下面的代码展示了如何使用Objective-C Metal API在GPU上并行渲染编码命令(上面的线程代码只是将输出的渲染分成网格部分进行并行处理;计算仍未并行执行)。这就是您在问题中所指的内容,即使它并不完全是您想要的。我提供这个答案是为了帮助那些可能偶然发现这个问题的人,认为它会提供与并行渲染相关的答案(实际上,它没有):
- (void)drawInMTKView:(MTKView *)view
{
dispatch_async(((AppDelegate *)UIApplication.sharedApplication.delegate).cameraViewQueue, ^{
id <CAMetalDrawable> drawable = [view currentDrawable]; //[(CAMetalLayer *)view.layer nextDrawable];
MTLRenderPassDescriptor *renderPassDesc = [view currentRenderPassDescriptor];
renderPassDesc.colorAttachments[0].loadAction = MTLLoadActionClear;
renderPassDesc.colorAttachments[0].clearColor = MTLClearColorMake(0.0,0.0,0.0,1.0);
renderPassDesc.renderTargetWidth = self.texture.width;
renderPassDesc.renderTargetHeight = self.texture.height;
renderPassDesc.colorAttachments[0].texture = drawable.texture;
if (renderPassDesc != nil)
{
dispatch_semaphore_wait(self._inflight_semaphore, DISPATCH_TIME_FOREVER);
id <MTLCommandBuffer> commandBuffer = [self.metalContext.commandQueue commandBuffer];
[commandBuffer enqueue];
// START PARALLEL RENDERING OPERATIONS HERE
id <MTLParallelRenderCommandEncoder> parallelRCE = [commandBuffer parallelRenderCommandEncoderWithDescriptor:renderPassDesc];
// FIRST PARALLEL RENDERING OPERATION
id <MTLRenderCommandEncoder> renderEncoder = [parallelRCE renderCommandEncoder];
[renderEncoder setRenderPipelineState:self.metalContext.renderPipelineState];
[renderEncoder setVertexBuffer:self.metalContext.vertexBuffer offset:0 atIndex:0];
[renderEncoder setVertexBuffer:self.metalContext.uniformBuffer offset:0 atIndex:1];
[renderEncoder setFragmentBuffer:self.metalContext.uniformBuffer offset:0 atIndex:0];
[renderEncoder setFragmentTexture:self.texture
atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
vertexStart:0
vertexCount:4
instanceCount:1];
[renderEncoder endEncoding];
// ADD SECOND, THIRD, ETC. PARALLEL RENDERING OPERATION HERE
.
.
.
// SUBMIT ALL RENDERING OPERATIONS IN PARALLEL HERE
[parallelRCE endEncoding];
__block dispatch_semaphore_t block_sema = self._inflight_semaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
dispatch_semaphore_signal(block_sema);
}];
if (drawable)
[commandBuffer presentDrawable:drawable];
[commandBuffer commit];
[commandBuffer waitUntilScheduled];
}
});
}
在上面的示例中,您将为要并行执行的每个计算复制与 renderEncoder 相关的内容。在您的代码示例中,我看不出这对您有什么好处,因为一个操作似乎依赖于另一个操作。那么,您可能希望得到的最好结果就是 warrenm 提供给您的代码,尽管这并不真正符合并行渲染的条件。