Vulkan 中的多线程渲染(命令缓冲区生成)比单线程慢
Multi threaded rendering (command buffer generation) in Vulkan is slower than single threaded
我正在尝试实现多线程命令缓冲区生成(使用每线程命令池和辅助命令缓冲区),但是使用多线程几乎没有性能提升。
首先,我认为我的线程池代码写错了,但我尝试了 Sascha Willems 的线程池实现,没有任何改变(所以我认为这不是问题)
其次,我搜索了多线程性能问题,发现从不同线程访问相同的 variables/resources 会导致性能下降,但我仍然无法找出问题所在。
我还下载了 Sascha Willems 的多线程代码,运行,它工作得很好。我修改了工作线程的数量,使用多线程的性能增益清晰可见。
以下是渲染 600 个对象(同一型号)的一些 FPS 结果。你可以看看我的问题是什么:
core count Sascha Willems's my result my result (avg. FPS)
result ( avg. FPS) (avg. FPS) validation layer disabled
1 45 30 55
2 83 33 72
4 110 40 84
6 155 42 103
8 162 42 104
10 173 40 111
12 175 40 119
这是我准备线程数据的地方
void prepareThreadData
{
primaryCommandPool = m_device.createCommandPool (
vk::CommandPoolCreateInfo (
vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
graphicsQueueIdx
)
);
primaryCommandBuffer = m_device.allocateCommandBuffers (
vk::CommandBufferAllocateInfo (
primaryCommandPool,
vk::CommandBufferLevel::ePrimary,
1
)
)[0];
threadData.resize(numberOfThreads);
for (int i = 0; i < numberOfThreads; ++i)
{
threadData[i].commandPool = m_device.createCommandPool (
vk::CommandPoolCreateInfo (
vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
graphicsQueueIdx
)
);
threadData[i].commandBuffer = m_device.allocateCommandBuffers (
vk::CommandBufferAllocateInfo (
threadData[i].commandPool,
vk::CommandBufferLevel::eSecondary,
numberOfObjectsPerThread
)
);
for (int j = 0; j < numberOfObjectsPerThread; ++j)
{
VertexPushConstant pushConstant = { someRandomPosition()};
threadData[i].pushConstBlock.push_back(pushConstant);
}
}
}
这是我的渲染循环代码,我在其中为每个线程分配作业:
while (!display.IsWindowClosed())
{
display.PollEvents();
m_device.acquireNextImageKHR(m_swapChain, std::numeric_limits<uint64_t>::max(), presentCompleteSemaphore, nullptr, ¤tBuffer);
primaryCommandBuffer.begin(vk::CommandBufferBeginInfo());
primaryCommandBuffer.beginRenderPass(
vk::RenderPassBeginInfo(m_renderPass, m_swapChainBuffers[currentBuffer].frameBuffer, m_renderArea, clearValues.size(), clearValues.data()),
vk::SubpassContents::eSecondaryCommandBuffers);
vk::CommandBufferInheritanceInfo inheritanceInfo = {};
inheritanceInfo.renderPass = m_renderPass;
inheritanceInfo.framebuffer = m_swapChainBuffers[currentBuffer].frameBuffer;
for (int t = 0; t < numberOfThreads; ++t)
{
for (int i = 0; i < numberOfObjectsPerThread; ++i)
{
threadPool.threads[t]->addJob([=]
{
std::array<vk::DeviceSize, 1> offsets = { 0 };
vk::Viewport viewport = vk::Viewport(0.0f, 0.0f, WIDTH, HEIGHT, 0.0f, 1.0f);
vk::Rect2D renderArea = vk::Rect2D(vk::Offset2D(), vk::Extent2D(WIDTH, HEIGHT));
threadData[t].commandBuffer[i].begin(vk::CommandBufferBeginInfo(vk::CommandBufferUsageFlagBits::eRenderPassContinue, &inheritanceInfo));
threadData[t].commandBuffer[i].setViewport(0, viewport);
threadData[t].commandBuffer[i].setScissor(0, renderArea);
threadData[t].commandBuffer[i].bindPipeline(vk::PipelineBindPoint::eGraphics, m_graphicsPipeline);
threadData[t].commandBuffer[i].bindVertexBuffers(VERTEX_BUFFER_BIND, 1, &model.vertexBuffer, offsets.data());
threadData[t].commandBuffer[i].bindIndexBuffer(model.indexBuffer, 0, vk::IndexType::eUint32);
threadData[t].commandBuffer[i].pushConstants(pipelineLayout, vk::ShaderStageFlagBits::eVertex, 0, sizeof(VertexPushConstant), &threadData[t].pushConstBlock[i]);
threadData[t].commandBuffer[i].drawIndexed(model.indexCount, 1, 0, 0, 0);
threadData[t].commandBuffer[i].end();
});
}
}
threadPool.wait();
std::vector<vk::CommandBuffer> commandBuffers;
for (int t = 0; t < numberOfThreads; ++t)
{
for (int i = 0; i < numberOfObjectsPerThread; ++i)
{
commandBuffers.push_back(threadData[t].commandBuffer[i]);
}
}
primaryCommandBuffer.executeCommands(commandBuffers.size(), commandBuffers.data());
primaryCommandBuffer.endRenderPass();
primaryCommandBuffer.end();
submitQueue(presentCompleteSemaphore, primaryCommandBuffer);
}
如果您对我遗漏了什么/我做错了什么有任何想法,请告诉我。
Here 是完整的 VS 2017 项目,如果有人想玩的话:D
我知道这很乱,但我只是在学习 Vulkan。
我似乎发现了问题所在:我启用了验证层。我禁用了它,并且性能提高了很多,我将问题中的 table 更新为第 4 行以进行比较。谁知道验证层会占用这么多 运行 时间。
如果有人想衡量 Vulkan 的性能,别忘了禁用它!
我正在尝试实现多线程命令缓冲区生成(使用每线程命令池和辅助命令缓冲区),但是使用多线程几乎没有性能提升。
首先,我认为我的线程池代码写错了,但我尝试了 Sascha Willems 的线程池实现,没有任何改变(所以我认为这不是问题)
其次,我搜索了多线程性能问题,发现从不同线程访问相同的 variables/resources 会导致性能下降,但我仍然无法找出问题所在。
我还下载了 Sascha Willems 的多线程代码,运行,它工作得很好。我修改了工作线程的数量,使用多线程的性能增益清晰可见。
以下是渲染 600 个对象(同一型号)的一些 FPS 结果。你可以看看我的问题是什么:
core count Sascha Willems's my result my result (avg. FPS)
result ( avg. FPS) (avg. FPS) validation layer disabled
1 45 30 55
2 83 33 72
4 110 40 84
6 155 42 103
8 162 42 104
10 173 40 111
12 175 40 119
这是我准备线程数据的地方
void prepareThreadData
{
primaryCommandPool = m_device.createCommandPool (
vk::CommandPoolCreateInfo (
vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
graphicsQueueIdx
)
);
primaryCommandBuffer = m_device.allocateCommandBuffers (
vk::CommandBufferAllocateInfo (
primaryCommandPool,
vk::CommandBufferLevel::ePrimary,
1
)
)[0];
threadData.resize(numberOfThreads);
for (int i = 0; i < numberOfThreads; ++i)
{
threadData[i].commandPool = m_device.createCommandPool (
vk::CommandPoolCreateInfo (
vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
graphicsQueueIdx
)
);
threadData[i].commandBuffer = m_device.allocateCommandBuffers (
vk::CommandBufferAllocateInfo (
threadData[i].commandPool,
vk::CommandBufferLevel::eSecondary,
numberOfObjectsPerThread
)
);
for (int j = 0; j < numberOfObjectsPerThread; ++j)
{
VertexPushConstant pushConstant = { someRandomPosition()};
threadData[i].pushConstBlock.push_back(pushConstant);
}
}
}
这是我的渲染循环代码,我在其中为每个线程分配作业:
while (!display.IsWindowClosed())
{
display.PollEvents();
m_device.acquireNextImageKHR(m_swapChain, std::numeric_limits<uint64_t>::max(), presentCompleteSemaphore, nullptr, ¤tBuffer);
primaryCommandBuffer.begin(vk::CommandBufferBeginInfo());
primaryCommandBuffer.beginRenderPass(
vk::RenderPassBeginInfo(m_renderPass, m_swapChainBuffers[currentBuffer].frameBuffer, m_renderArea, clearValues.size(), clearValues.data()),
vk::SubpassContents::eSecondaryCommandBuffers);
vk::CommandBufferInheritanceInfo inheritanceInfo = {};
inheritanceInfo.renderPass = m_renderPass;
inheritanceInfo.framebuffer = m_swapChainBuffers[currentBuffer].frameBuffer;
for (int t = 0; t < numberOfThreads; ++t)
{
for (int i = 0; i < numberOfObjectsPerThread; ++i)
{
threadPool.threads[t]->addJob([=]
{
std::array<vk::DeviceSize, 1> offsets = { 0 };
vk::Viewport viewport = vk::Viewport(0.0f, 0.0f, WIDTH, HEIGHT, 0.0f, 1.0f);
vk::Rect2D renderArea = vk::Rect2D(vk::Offset2D(), vk::Extent2D(WIDTH, HEIGHT));
threadData[t].commandBuffer[i].begin(vk::CommandBufferBeginInfo(vk::CommandBufferUsageFlagBits::eRenderPassContinue, &inheritanceInfo));
threadData[t].commandBuffer[i].setViewport(0, viewport);
threadData[t].commandBuffer[i].setScissor(0, renderArea);
threadData[t].commandBuffer[i].bindPipeline(vk::PipelineBindPoint::eGraphics, m_graphicsPipeline);
threadData[t].commandBuffer[i].bindVertexBuffers(VERTEX_BUFFER_BIND, 1, &model.vertexBuffer, offsets.data());
threadData[t].commandBuffer[i].bindIndexBuffer(model.indexBuffer, 0, vk::IndexType::eUint32);
threadData[t].commandBuffer[i].pushConstants(pipelineLayout, vk::ShaderStageFlagBits::eVertex, 0, sizeof(VertexPushConstant), &threadData[t].pushConstBlock[i]);
threadData[t].commandBuffer[i].drawIndexed(model.indexCount, 1, 0, 0, 0);
threadData[t].commandBuffer[i].end();
});
}
}
threadPool.wait();
std::vector<vk::CommandBuffer> commandBuffers;
for (int t = 0; t < numberOfThreads; ++t)
{
for (int i = 0; i < numberOfObjectsPerThread; ++i)
{
commandBuffers.push_back(threadData[t].commandBuffer[i]);
}
}
primaryCommandBuffer.executeCommands(commandBuffers.size(), commandBuffers.data());
primaryCommandBuffer.endRenderPass();
primaryCommandBuffer.end();
submitQueue(presentCompleteSemaphore, primaryCommandBuffer);
}
如果您对我遗漏了什么/我做错了什么有任何想法,请告诉我。
Here 是完整的 VS 2017 项目,如果有人想玩的话:D
我知道这很乱,但我只是在学习 Vulkan。
我似乎发现了问题所在:我启用了验证层。我禁用了它,并且性能提高了很多,我将问题中的 table 更新为第 4 行以进行比较。谁知道验证层会占用这么多 运行 时间。 如果有人想衡量 Vulkan 的性能,别忘了禁用它!