金属块渲染

Chunk Rendering in Metal

我正在尝试使用 Metal 创建程序游戏,并且我正在使用基于八叉树的块方法来实现细节级别。

我使用的方法涉及 CPU 为地形创建八叉树节点,然后使用计算着色器在 GPU 上创建其网格。此网格存储在用于渲染的块对象中的顶点缓冲区和索引缓冲区中。

所有这些似乎都运行良好,但是在渲染块时我很早就遇到了性能问题。目前我收集了一组要绘制的块,然后将其提交给我的渲染器,渲染器将创建一个 MTLParallelRenderCommandEncoder,然后为每个块创建一个 MTLRenderCommandEncoder,然后将其提交给 GPU。

从外观上看,大约 50% 的 CPU 时间花在了为每个块创建 MTLRenderCommandEncoder 上。目前我只是为每个块创建一个简单的 8 顶点立方体网格,我有一个 4x4x4 块阵列,在这些早期阶段我下降到大约 50fps。 (实际上似乎每个 MTLParallelRenderCommandEncoder 中最多只能有 63 MTLRenderCommandEncoder,所以它不是完整的 4x4x4)

我读到 MTLParallelRenderCommandEncoder 的要点是在单独的线程中创建每个 MTLRenderCommandEncoder,但我不太幸运能够让它工作。同样是多线程,它不会绕过最大渲染 63 个块的上限。

我觉得以某种方式将每个块的顶点和索引缓冲区合并为一个或两个更大的提交缓冲区会有所帮助,但我不确定如何在没有大量 memcpy() 调用的情况下做到这一点以及是否这甚至不会提高效率。

这是我的代码,它接受节点数组并绘制它们:

func drawNodes(nodes: [OctreeNode], inView view: AHMetalView){
  // For control of several rotating buffers
  dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)

  makeDepthTexture()

  updateUniformsForView(view, duration: view.frameDuration)
  let commandBuffer = commandQueue.commandBuffer()


  let optDrawable = layer.nextDrawable()

  guard let drawable = optDrawable else{
    return
  }

  let passDescriptor = MTLRenderPassDescriptor()

  passDescriptor.colorAttachments[0].texture = drawable.texture
  passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
  passDescriptor.colorAttachments[0].storeAction = .Store
  passDescriptor.colorAttachments[0].loadAction = .Clear

  passDescriptor.depthAttachment.texture = depthTexture
  passDescriptor.depthAttachment.clearDepth = 1
  passDescriptor.depthAttachment.loadAction = .Clear
  passDescriptor.depthAttachment.storeAction = .Store

  let parallelRenderPass = commandBuffer.parallelRenderCommandEncoderWithDescriptor(passDescriptor)

  // Currently 63 nodes as a maximum
  for node in nodes{
    // This line is taking up around 50% of the CPU time
    let renderPass = parallelRenderPass.renderCommandEncoder()

    renderPass.setRenderPipelineState(renderPipelineState)
    renderPass.setDepthStencilState(depthStencilState)
    renderPass.setFrontFacingWinding(.CounterClockwise)
    renderPass.setCullMode(.Back)

    let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex

    renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
    renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)

    renderPass.setTriangleFillMode(.Lines)

    renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)

    renderPass.endEncoding()
  }
  parallelRenderPass.endEncoding()

  commandBuffer.presentDrawable(drawable)

  commandBuffer.addCompletedHandler { (commandBuffer) -> Void in
    self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
    dispatch_semaphore_signal(self.displaySemaphore)
  }

  commandBuffer.commit()
}

您注意到:

I've read that the point of the MTLParallelRenderCommandEncoder is to create each MTLRenderCommandEncoder in a separate thread...

你是对的。您正在做的是 顺序 创建、编码和结束命令编码器 — 这里没有任何并行操作,因此 MTLParallelRenderCommandEncoder 对您没有任何作用。如果您消除并行编码器并在每次通过您的 for 循环时只创建带有 renderCommandEncoderWithDescriptor(_:) 的编码器,您将获得大致相同的性能......也就是说,由于以下原因,您仍然会遇到相同的性能问题创建所有这些编码器的开销。

因此,如果您要按顺序编码,只需重复使用相同的编码器即可。此外,您应该尽可能多地重用其他共享状态。这是对可能的重构(未经测试)的快速介绍:

let passDescriptor = MTLRenderPassDescriptor()

// call this once before your render loop
func setup() {
    makeDepthTexture()

    passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
    passDescriptor.colorAttachments[0].storeAction = .Store
    passDescriptor.colorAttachments[0].loadAction = .Clear

    passDescriptor.depthAttachment.texture = depthTexture
    passDescriptor.depthAttachment.clearDepth = 1
    passDescriptor.depthAttachment.loadAction = .Clear
    passDescriptor.depthAttachment.storeAction = .Store

    // set up render pipeline state and depthStencil state
}

func drawNodes(nodes: [OctreeNode], inView view: AHMetalView) {

    updateUniformsForView(view, duration: view.frameDuration)

    // Set up completed handler ahead of time
    let commandBuffer = commandQueue.commandBuffer()
    commandBuffer.addCompletedHandler { _ in // unused parameter
        self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
        dispatch_semaphore_signal(self.displaySemaphore)
    }

    // Semaphore should be tied to drawable acquisition
    dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)
    guard let drawable = layer.nextDrawable()
        else { return }

    // Set up the one part of the pass descriptor that changes per-frame
    passDescriptor.colorAttachments[0].texture = drawable.texture

    // Get one render pass descriptor and reuse it
    let renderPass = commandBuffer.renderCommandEncoderWithDescriptor(passDescriptor)
    renderPass.setTriangleFillMode(.Lines)
    renderPass.setRenderPipelineState(renderPipelineState)
    renderPass.setDepthStencilState(depthStencilState)

    for node in nodes {
        // Update offsets and draw
        let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex
        renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
        renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)
        renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)

    }
    renderPass.endEncoding()

    commandBuffer.presentDrawable(drawable)
    commandBuffer.commit()
}

然后,使用 Instruments 进行分析以查看您可能遇到的其他性能问题(如果有的话)。有一个很好的 WWDC 2015 session 显示了几个常见的 "gotchas",如何在分析中诊断它们,以及如何修复它们。