在 OpenMP 中增加共享循环计数器以进行进度报告
Increment shared loop counter in OpenMP for progress reporting
我想跟踪经过长时间 运行 光线追踪过程处理的总像素和光线。如果我每次迭代都更新共享变量,由于同步,进程会明显变慢。我想跟踪进度并在最后仍然获得准确的计数结果。有没有办法使用 OpenMP for 循环来做到这一点?
下面是有问题的循环的一些代码:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 4096)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount); // will increment sharedRayCount
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++sharedPixelCount;
}
}
这里有一个如何操作的例子:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int rayCount = 0;
int previousRayCount = 0;
#pragma omp parallel for schedule(dynamic, 1000) reduction(+:rayCount) firstprivate(previousRayCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, rayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
if ((i + 1) % 100 == 0)
{
sharedPixelCount += 100;
sharedRayCount += (rayCount - previousRayCount);
previousRayCount = rayCount;
}
}
sharedPixelCount = totalPixelCount;
sharedRayCount = rayCount;
}
循环运行时不会100%准确,但误差可以忽略不计。最后将报告确切的值。
既然动态调度的并行 for 循环的块大小为 4096,为什么不将其用作分摊计数器更新的粒度?
例如,类似下面的内容可能会起作用。我没有测试这段代码,您可能需要为 totalPixelCount%4096!=0
.
添加一些簿记
与之前的答案不同,除了循环本身隐含的分支外,这不会向您的循环添加分支,许多处理器都为此优化了指令。它也不需要任何额外的变量或算术。
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 1)
for (int j = 0; j < totalPixelCount; j+=4096)
{
for (int i = j; i < (i+4096); ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
}
sharedPixelCount += 4096;
}
}
不太清楚为什么 sharedPixelCount
需要在这个循环内更新,因为它没有在循环体中引用。如果这是正确的,我建议改为以下。
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int reducePixelCount = 0;
#pragma omp parallel for schedule(dynamic, 4096) \
reduction(+:reducePixelCount) \
shared(reducePixelCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++reducePixelCount; /* thread-local operation, not atomic */
}
/* The interoperability of C++11 atomics and OpenMP is not defined yet,
* so this should just be avoided until OpenMP 5 at the earliest.
* It is sufficient to reduce over a non-atomic type and
* do the assignment here. */
sharedPixelCount = reducePixelCount;
}
我想跟踪经过长时间 运行 光线追踪过程处理的总像素和光线。如果我每次迭代都更新共享变量,由于同步,进程会明显变慢。我想跟踪进度并在最后仍然获得准确的计数结果。有没有办法使用 OpenMP for 循环来做到这一点?
下面是有问题的循环的一些代码:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 4096)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount); // will increment sharedRayCount
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++sharedPixelCount;
}
}
这里有一个如何操作的例子:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int rayCount = 0;
int previousRayCount = 0;
#pragma omp parallel for schedule(dynamic, 1000) reduction(+:rayCount) firstprivate(previousRayCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, rayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
if ((i + 1) % 100 == 0)
{
sharedPixelCount += 100;
sharedRayCount += (rayCount - previousRayCount);
previousRayCount = rayCount;
}
}
sharedPixelCount = totalPixelCount;
sharedRayCount = rayCount;
}
循环运行时不会100%准确,但误差可以忽略不计。最后将报告确切的值。
既然动态调度的并行 for 循环的块大小为 4096,为什么不将其用作分摊计数器更新的粒度?
例如,类似下面的内容可能会起作用。我没有测试这段代码,您可能需要为 totalPixelCount%4096!=0
.
与之前的答案不同,除了循环本身隐含的分支外,这不会向您的循环添加分支,许多处理器都为此优化了指令。它也不需要任何额外的变量或算术。
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 1)
for (int j = 0; j < totalPixelCount; j+=4096)
{
for (int i = j; i < (i+4096); ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
}
sharedPixelCount += 4096;
}
}
不太清楚为什么 sharedPixelCount
需要在这个循环内更新,因为它没有在循环体中引用。如果这是正确的,我建议改为以下。
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int reducePixelCount = 0;
#pragma omp parallel for schedule(dynamic, 4096) \
reduction(+:reducePixelCount) \
shared(reducePixelCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++reducePixelCount; /* thread-local operation, not atomic */
}
/* The interoperability of C++11 atomics and OpenMP is not defined yet,
* so this should just be avoided until OpenMP 5 at the earliest.
* It is sufficient to reduce over a non-atomic type and
* do the assignment here. */
sharedPixelCount = reducePixelCount;
}