Metal Compute 管道不适用于 MacOS,但适用于 iOS
Metal Compute pipeline not working on MacOS, but working on iOS
我正在尝试使用 Metal 进行一些 GPGPU 计算。我有一个基本的金属管道:
- 创建所需的
MTLComputePipelineState
管道和所有关联对象(MTLComputeCommandEncoder
、命令队列等);
- 创建用于书写的目标纹理(使用
desc.usage = MTLTextureUsageShaderWrite;
);
- 启动一个基本的着色器来用一些值填充这个纹理(在我的实验中,要么将其中一个颜色分量设置为 1,要么根据线程坐标创建一个灰度值渐变);
- 从 GPU 读回此纹理的内容。
我正在 2 个设置中测试此代码:
- 在 OSX 10.11 上使用 MacBook Pro 2013 年初;
- 在 iOS 9 上 iPhone 6.
iOS 版本运行得很好,我得到了我要求着色器执行的操作。但是在 OSX 上 我得到了一个有效的(非零,大小正确)输出纹理。但是,当取回数据时 我得到的所有地方都是 0。
我是否遗漏了特定于 OS X 实现的步骤?这似乎发生在 NVIDIA GT650M 和 Intel HD4000 上,或者可能是运行时的错误?
由于我目前不知道如何进一步调查这个问题,如果您能提供这方面的帮助,我们将不胜感激:-)
编辑 - 我当前的实现
这是我的实施的初始(失败)状态。它有点长,但主要是创建管道的样板代码:
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> library = [device newDefaultLibrary];
id<MTLCommandQueue> commandQueue = [device newCommandQueue];
NSError *error = nil;
id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:[library
newFunctionWithName:@"dummy"]
error:&error];
if (error)
{
NSLog(@"%@", [error localizedDescription]);
}
MTLTextureDescriptor *desc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatRGBA8Unorm
width:16
height:1
mipmapped:NO];
desc.usage = MTLTextureUsageShaderWrite;
id<MTLTexture> texture = [device newTextureWithDescriptor:desc];
MTLSize threadGroupCounts = MTLSizeMake(8, 1, 1);
MTLSize threadGroups = MTLSizeMake([texture width] / threadGroupCounts.width,
[texture height] / threadGroupCounts.height,
1);
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState:pipeline];
[commandEncoder setTexture:texture atIndex:0];
[commandEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup:threadGroupCounts];
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
用于获取数据的代码如下(我将文件分成两部分以获得更小的代码块):
// Get the data back
uint8_t* imageBytes = malloc([texture width] * [texture height] * 4);
assert(imageBytes);
MTLRegion region = MTLRegionMake2D(0, 0, [texture width], [texture height]);
[texture getBytes:imageBytes bytesPerRow:[texture width]*4 fromRegion:region mipmapLevel:0];
for (int i = 0; i < 16; ++i)
{
NSLog(@"Pix = %d %d %d %d",
*((uint8_t*)imageBytes + 4 * i),
*((uint8_t*)imageBytes + 4 * i + 1),
*((uint8_t*)imageBytes + 4 * i + 2),
*((uint8_t*)imageBytes + 4 * i + 3));
}
这是着色器代码(将 1 写入红色和 alpha,在主机上读取时应该在输出缓冲区中变为 0xff):
#include <metal_stdlib>
using namespace metal;
kernel void dummy(texture2d<float, access::write> outTexture [[ texture(0) ]],
uint2 gid [[ thread_position_in_grid ]])
{
outTexture.write(float4(1.0, 0.0, 0.0, 1.0), gid);
}
我想你没有打电话给 - synchronizeTexture:slice:level:
可能是以下示例(jpeg-turbo 作者 class 实现的一部分)可以解决您的问题:
row_stride = (int)cinfo.image_width * cinfo.input_components; /* JSAMPLEs per row in image_buffer */
uint counts = cinfo.image_width * 4;
uint componentSize = sizeof(uint8);
uint8 *tmp = NULL;
if (texture.pixelFormat == MTLPixelFormatRGBA16Unorm) {
tmp = malloc(row_stride);
row_stride *= 2;
componentSize = sizeof(uint16);
}
//
// Synchronize texture with host memory
//
id<MTLCommandQueue> queue = [texture.device newCommandQueue];
id<MTLCommandBuffer> commandBuffer = [queue commandBuffer];
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer blitCommandEncoder];
[blitEncoder synchronizeTexture:texture slice:0 level:0];
[blitEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
void *image_buffer = malloc(row_stride);
int j=0;
while (cinfo.next_scanline < cinfo.image_height) {
MTLRegion region = MTLRegionMake2D(0, cinfo.next_scanline, cinfo.image_width, 1);
[texture getBytes:image_buffer
bytesPerRow:cinfo.image_width * 4 * componentSize
fromRegion:region
mipmapLevel:0];
if (texture.pixelFormat == MTLPixelFormatRGBA16Unorm) {
uint16 *s = image_buffer;
for (int i=0; i<counts; i++) {
tmp[i] = (s[i]>>8) & 0xff;
j++;
}
row_pointer[0] = tmp;
}
else{
row_pointer[0] = image_buffer;
}
(void) jpeg_write_scanlines(&cinfo, row_pointer, 1);
}
free(image_buffer);
if (tmp != NULL) free(tmp);
它于 2012 年年中 mac book pro 使用 NVIDIA GeForce GT 650M 1024 МБ 进行了测试。
这是我们可以同时使用 blit 命令编码器和计算命令编码器的示例。在任何计算之后,您可以应用 blit 操作,或者在任何您想要的上下文中。
self.context.execute { (commandBuffer) -> Void in
//
// clear buffer
//
let blitEncoder = commandBuffer.blitCommandEncoder()
blitEncoder.fillBuffer(buffer, range: NSMakeRange(0, buffer.length), value: 0)
let commandEncoder = commandBuffer.computeCommandEncoder()
//
// create compute pipe
//
commandEncoder.setComputePipelineState(self.kernel.pipeline!);
commandEncoder.setTexture(texture, atIndex:0)
commandEncoder.setBuffer(buffer, offset:0, atIndex:0)
commandEncoder.setBuffer(self.channelsToComputeBuffer,offset:0, atIndex:1)
commandEncoder.setBuffer(self.regionUniformBuffer, offset:0, atIndex:2)
commandEncoder.setBuffer(self.scaleUniformBuffer, offset:0, atIndex:3)
self.configure(self.kernel, command: commandEncoder)
//
// compute
//
commandEncoder.dispatchThreadgroups(self.threadgroups, threadsPerThreadgroup:threadgroupCounts);
commandEncoder.endEncoding()
//
// synchronize texture state
//
[blitEncoder synchronizeTexture:texture slice:0 level:0];
[blitEncoder endEncoding];
}
同步码:
[commandEncoder setTexture:texture atIndex:0];
[commandEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup:threadGroupCounts];
[commandEncoder endEncoding];
//
// synchronize texture from gpu to host mem
//
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer blitCommandEncoder];
[blitEncoder synchronizeTexture:texture slice:0 level:0];
[blitEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
这是在 Mac 预订 2012 年年中使用与您拥有的相同 GPU 和 2015 年年中使用 AMD Radeon R9 M370X 2048 МБ 预订。
有时我使用以下技巧在不同步的情况下获取纹理数据:
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState:pipeline];
[commandEncoder setTexture:texture atIndex:0];
[commandEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup:threadGroupCounts];
[commandEncoder endEncoding];
//
// one trick: copy texture from GPU mem to shared
//
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer blitCommandEncoder];
[blitEncoder copyFromTexture:texture
sourceSlice: 0
sourceLevel: 0
sourceOrigin: MTLOriginMake(0, 0, 0)
sourceSize: MTLSizeMake([texture width], [texture height], 1)
toBuffer: texturebuffer
destinationOffset: 0
destinationBytesPerRow: [texture width] * 4
destinationBytesPerImage: 0];
[blitEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
// Get the data back
uint8_t* imageBytes = [texturebuffer contents];
for (int i = 0; i < 16; ++i)
{
NSLog(@"Pix = %d %d %d %d",
*((uint8_t*)imageBytes + 4 * i),
*((uint8_t*)imageBytes + 4 * i + 1),
*((uint8_t*)imageBytes + 4 * i + 2),
*((uint8_t*)imageBytes + 4 * i + 3));
}
两种方法都可以正常工作。
我正在尝试使用 Metal 进行一些 GPGPU 计算。我有一个基本的金属管道:
- 创建所需的
MTLComputePipelineState
管道和所有关联对象(MTLComputeCommandEncoder
、命令队列等); - 创建用于书写的目标纹理(使用
desc.usage = MTLTextureUsageShaderWrite;
); - 启动一个基本的着色器来用一些值填充这个纹理(在我的实验中,要么将其中一个颜色分量设置为 1,要么根据线程坐标创建一个灰度值渐变);
- 从 GPU 读回此纹理的内容。
我正在 2 个设置中测试此代码:
- 在 OSX 10.11 上使用 MacBook Pro 2013 年初;
- 在 iOS 9 上 iPhone 6.
iOS 版本运行得很好,我得到了我要求着色器执行的操作。但是在 OSX 上 我得到了一个有效的(非零,大小正确)输出纹理。但是,当取回数据时 我得到的所有地方都是 0。
我是否遗漏了特定于 OS X 实现的步骤?这似乎发生在 NVIDIA GT650M 和 Intel HD4000 上,或者可能是运行时的错误?
由于我目前不知道如何进一步调查这个问题,如果您能提供这方面的帮助,我们将不胜感激:-)
编辑 - 我当前的实现
这是我的实施的初始(失败)状态。它有点长,但主要是创建管道的样板代码:
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> library = [device newDefaultLibrary];
id<MTLCommandQueue> commandQueue = [device newCommandQueue];
NSError *error = nil;
id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:[library
newFunctionWithName:@"dummy"]
error:&error];
if (error)
{
NSLog(@"%@", [error localizedDescription]);
}
MTLTextureDescriptor *desc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatRGBA8Unorm
width:16
height:1
mipmapped:NO];
desc.usage = MTLTextureUsageShaderWrite;
id<MTLTexture> texture = [device newTextureWithDescriptor:desc];
MTLSize threadGroupCounts = MTLSizeMake(8, 1, 1);
MTLSize threadGroups = MTLSizeMake([texture width] / threadGroupCounts.width,
[texture height] / threadGroupCounts.height,
1);
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState:pipeline];
[commandEncoder setTexture:texture atIndex:0];
[commandEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup:threadGroupCounts];
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
用于获取数据的代码如下(我将文件分成两部分以获得更小的代码块):
// Get the data back
uint8_t* imageBytes = malloc([texture width] * [texture height] * 4);
assert(imageBytes);
MTLRegion region = MTLRegionMake2D(0, 0, [texture width], [texture height]);
[texture getBytes:imageBytes bytesPerRow:[texture width]*4 fromRegion:region mipmapLevel:0];
for (int i = 0; i < 16; ++i)
{
NSLog(@"Pix = %d %d %d %d",
*((uint8_t*)imageBytes + 4 * i),
*((uint8_t*)imageBytes + 4 * i + 1),
*((uint8_t*)imageBytes + 4 * i + 2),
*((uint8_t*)imageBytes + 4 * i + 3));
}
这是着色器代码(将 1 写入红色和 alpha,在主机上读取时应该在输出缓冲区中变为 0xff):
#include <metal_stdlib>
using namespace metal;
kernel void dummy(texture2d<float, access::write> outTexture [[ texture(0) ]],
uint2 gid [[ thread_position_in_grid ]])
{
outTexture.write(float4(1.0, 0.0, 0.0, 1.0), gid);
}
我想你没有打电话给 - synchronizeTexture:slice:level: 可能是以下示例(jpeg-turbo 作者 class 实现的一部分)可以解决您的问题:
row_stride = (int)cinfo.image_width * cinfo.input_components; /* JSAMPLEs per row in image_buffer */
uint counts = cinfo.image_width * 4;
uint componentSize = sizeof(uint8);
uint8 *tmp = NULL;
if (texture.pixelFormat == MTLPixelFormatRGBA16Unorm) {
tmp = malloc(row_stride);
row_stride *= 2;
componentSize = sizeof(uint16);
}
//
// Synchronize texture with host memory
//
id<MTLCommandQueue> queue = [texture.device newCommandQueue];
id<MTLCommandBuffer> commandBuffer = [queue commandBuffer];
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer blitCommandEncoder];
[blitEncoder synchronizeTexture:texture slice:0 level:0];
[blitEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
void *image_buffer = malloc(row_stride);
int j=0;
while (cinfo.next_scanline < cinfo.image_height) {
MTLRegion region = MTLRegionMake2D(0, cinfo.next_scanline, cinfo.image_width, 1);
[texture getBytes:image_buffer
bytesPerRow:cinfo.image_width * 4 * componentSize
fromRegion:region
mipmapLevel:0];
if (texture.pixelFormat == MTLPixelFormatRGBA16Unorm) {
uint16 *s = image_buffer;
for (int i=0; i<counts; i++) {
tmp[i] = (s[i]>>8) & 0xff;
j++;
}
row_pointer[0] = tmp;
}
else{
row_pointer[0] = image_buffer;
}
(void) jpeg_write_scanlines(&cinfo, row_pointer, 1);
}
free(image_buffer);
if (tmp != NULL) free(tmp);
它于 2012 年年中 mac book pro 使用 NVIDIA GeForce GT 650M 1024 МБ 进行了测试。
这是我们可以同时使用 blit 命令编码器和计算命令编码器的示例。在任何计算之后,您可以应用 blit 操作,或者在任何您想要的上下文中。
self.context.execute { (commandBuffer) -> Void in
//
// clear buffer
//
let blitEncoder = commandBuffer.blitCommandEncoder()
blitEncoder.fillBuffer(buffer, range: NSMakeRange(0, buffer.length), value: 0)
let commandEncoder = commandBuffer.computeCommandEncoder()
//
// create compute pipe
//
commandEncoder.setComputePipelineState(self.kernel.pipeline!);
commandEncoder.setTexture(texture, atIndex:0)
commandEncoder.setBuffer(buffer, offset:0, atIndex:0)
commandEncoder.setBuffer(self.channelsToComputeBuffer,offset:0, atIndex:1)
commandEncoder.setBuffer(self.regionUniformBuffer, offset:0, atIndex:2)
commandEncoder.setBuffer(self.scaleUniformBuffer, offset:0, atIndex:3)
self.configure(self.kernel, command: commandEncoder)
//
// compute
//
commandEncoder.dispatchThreadgroups(self.threadgroups, threadsPerThreadgroup:threadgroupCounts);
commandEncoder.endEncoding()
//
// synchronize texture state
//
[blitEncoder synchronizeTexture:texture slice:0 level:0];
[blitEncoder endEncoding];
}
同步码:
[commandEncoder setTexture:texture atIndex:0];
[commandEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup:threadGroupCounts];
[commandEncoder endEncoding];
//
// synchronize texture from gpu to host mem
//
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer blitCommandEncoder];
[blitEncoder synchronizeTexture:texture slice:0 level:0];
[blitEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
这是在 Mac 预订 2012 年年中使用与您拥有的相同 GPU 和 2015 年年中使用 AMD Radeon R9 M370X 2048 МБ 预订。
有时我使用以下技巧在不同步的情况下获取纹理数据:
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState:pipeline];
[commandEncoder setTexture:texture atIndex:0];
[commandEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup:threadGroupCounts];
[commandEncoder endEncoding];
//
// one trick: copy texture from GPU mem to shared
//
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer blitCommandEncoder];
[blitEncoder copyFromTexture:texture
sourceSlice: 0
sourceLevel: 0
sourceOrigin: MTLOriginMake(0, 0, 0)
sourceSize: MTLSizeMake([texture width], [texture height], 1)
toBuffer: texturebuffer
destinationOffset: 0
destinationBytesPerRow: [texture width] * 4
destinationBytesPerImage: 0];
[blitEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
// Get the data back
uint8_t* imageBytes = [texturebuffer contents];
for (int i = 0; i < 16; ++i)
{
NSLog(@"Pix = %d %d %d %d",
*((uint8_t*)imageBytes + 4 * i),
*((uint8_t*)imageBytes + 4 * i + 1),
*((uint8_t*)imageBytes + 4 * i + 2),
*((uint8_t*)imageBytes + 4 * i + 3));
}
两种方法都可以正常工作。