Vulkan 中的多线程渲染(命令缓冲区生成)比单线程慢

Multi threaded rendering (command buffer generation) in Vulkan is slower than single threaded

我正在尝试实现多线程命令缓冲区生成(使用每线程命令池和辅助命令缓冲区),但是使用多线程几乎没有性能提升。

首先,我认为我的线程池代码写错了,但我尝试了 Sascha Willems 的线程池实现,没有任何改变(所以我认为这不是问题)

其次,我搜索了多线程性能问题,发现从不同线程访问相同的 variables/resources 会导致性能下降,但我仍然无法找出问题所在。

我还下载了 Sascha Willems 的多线程代码,运行,它工作得很好。我修改了工作线程的数量,使用多线程的性能增益清晰可见。

以下是渲染 600 个对象(同一型号)的一些 FPS 结果。你可以看看我的问题是什么:

core count      Sascha Willems's        my result           my result (avg. FPS)
              result ( avg. FPS)       (avg. FPS)        validation layer disabled

    1               45                      30                      55
    2               83                      33                      72
    4               110                     40                      84
    6               155                     42                      103
    8               162                     42                      104
    10              173                     40                      111
    12              175                     40                      119

这是我准备线程数据的地方

void prepareThreadData
{
primaryCommandPool = m_device.createCommandPool (
    vk::CommandPoolCreateInfo (
        vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
        graphicsQueueIdx
    )
);

primaryCommandBuffer = m_device.allocateCommandBuffers (
    vk::CommandBufferAllocateInfo (
        primaryCommandPool,
        vk::CommandBufferLevel::ePrimary,
        1
    )
)[0];

threadData.resize(numberOfThreads);

for (int i = 0; i < numberOfThreads; ++i)
{
    threadData[i].commandPool = m_device.createCommandPool (
        vk::CommandPoolCreateInfo (
            vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
            graphicsQueueIdx
        )
    );

    threadData[i].commandBuffer = m_device.allocateCommandBuffers (
        vk::CommandBufferAllocateInfo (
            threadData[i].commandPool,
            vk::CommandBufferLevel::eSecondary,
            numberOfObjectsPerThread
        )
    );

    for (int j = 0; j < numberOfObjectsPerThread; ++j)
    {
        VertexPushConstant pushConstant = { someRandomPosition()};
        threadData[i].pushConstBlock.push_back(pushConstant);
    }
}
}

这是我的渲染循环代码,我在其中为每个线程分配作业:

while (!display.IsWindowClosed())
{
display.PollEvents();

m_device.acquireNextImageKHR(m_swapChain, std::numeric_limits<uint64_t>::max(), presentCompleteSemaphore, nullptr, &currentBuffer);

primaryCommandBuffer.begin(vk::CommandBufferBeginInfo());
primaryCommandBuffer.beginRenderPass(
    vk::RenderPassBeginInfo(m_renderPass, m_swapChainBuffers[currentBuffer].frameBuffer, m_renderArea, clearValues.size(), clearValues.data()),
    vk::SubpassContents::eSecondaryCommandBuffers);

vk::CommandBufferInheritanceInfo inheritanceInfo = {};
inheritanceInfo.renderPass = m_renderPass;
inheritanceInfo.framebuffer = m_swapChainBuffers[currentBuffer].frameBuffer;

for (int t = 0; t < numberOfThreads; ++t)
{
    for (int i = 0; i < numberOfObjectsPerThread; ++i)
    {
        threadPool.threads[t]->addJob([=]
        {
            std::array<vk::DeviceSize, 1> offsets = { 0 };
            vk::Viewport viewport = vk::Viewport(0.0f, 0.0f, WIDTH, HEIGHT, 0.0f, 1.0f);
            vk::Rect2D renderArea = vk::Rect2D(vk::Offset2D(), vk::Extent2D(WIDTH, HEIGHT));

            threadData[t].commandBuffer[i].begin(vk::CommandBufferBeginInfo(vk::CommandBufferUsageFlagBits::eRenderPassContinue, &inheritanceInfo));
            threadData[t].commandBuffer[i].setViewport(0, viewport);
            threadData[t].commandBuffer[i].setScissor(0, renderArea);
            threadData[t].commandBuffer[i].bindPipeline(vk::PipelineBindPoint::eGraphics, m_graphicsPipeline);
            threadData[t].commandBuffer[i].bindVertexBuffers(VERTEX_BUFFER_BIND, 1, &model.vertexBuffer, offsets.data());
            threadData[t].commandBuffer[i].bindIndexBuffer(model.indexBuffer, 0, vk::IndexType::eUint32);
            threadData[t].commandBuffer[i].pushConstants(pipelineLayout, vk::ShaderStageFlagBits::eVertex, 0, sizeof(VertexPushConstant), &threadData[t].pushConstBlock[i]);
            threadData[t].commandBuffer[i].drawIndexed(model.indexCount, 1, 0, 0, 0);
            threadData[t].commandBuffer[i].end();
        });
    }
}

threadPool.wait();

std::vector<vk::CommandBuffer> commandBuffers;
for (int t = 0; t < numberOfThreads; ++t)
{
    for (int i = 0; i < numberOfObjectsPerThread; ++i)
    {
        commandBuffers.push_back(threadData[t].commandBuffer[i]);
    }
}

primaryCommandBuffer.executeCommands(commandBuffers.size(), commandBuffers.data());
primaryCommandBuffer.endRenderPass();
primaryCommandBuffer.end();

submitQueue(presentCompleteSemaphore, primaryCommandBuffer);
}

如果您对我遗漏了什么/我做错了什么有任何想法,请告诉我。

Here 是完整的 VS 2017 项目,如果有人想玩的话:D

我知道这很乱,但我只是在学习 Vulkan。

我似乎发现了问题所在:我启用了验证层。我禁用了它,并且性能提高了很多,我将问题中的 table 更新为第 4 行以进行比较。谁知道验证层会占用这么多 运行 时间。 如果有人想衡量 Vulkan 的性能,别忘了禁用它!