OpenMP 并行线程数交替出现严重的性能损失
Severe performance loss alternating number of OpenMP parallel threads
以下代码更改用于交替并行 for 的并行线程数。
#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>
std::vector<float> v;
float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
float total = 0;
std::vector<int>nthreads{threadsFirst,threadsSecond};
for (int nthread : nthreads) {
omp_set_num_threads(nthread);
#pragma omp parallel for
for (int i = 0; i < tasks; ++i) {
for (int n = 0; n < perTaskComputation; ++n) {
if (v[i] > 5) {
v[i] * 0.002;
}
v[i] *= 1.1F * (i + 1);
}
}
for (auto a : v) {
total += a;
}
}
return total;
}
int main()
{
int tasks = 1000;
int load = 1000;
v.resize(tasks, 1);
for (int threadAdd = 0; threadAdd <= 1; ++threadAdd) {
std::cout << "Run batch\n";
for (int j = 1; j <= 16; j += 1) {
float minT = 1e100;
float maxT = 0;
float totalT = 0;
int samples = 0;
int iters = 100;
for (float i = 0; i <= iters; ++i) {
auto start = std::chrono::steady_clock::now();
foo(tasks, load, j, j + threadAdd);
auto end = std::chrono::steady_clock::now();
float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
if (i > 20) {
minT = std::min(minT, ms);
maxT = std::max(maxT, ms);
totalT += ms;
samples++;
}
}
std::cout << "Run parallel fors with " <<j << " and " << j + threadAdd << " threads -- Min: "
<< minT << "ms Max: " << maxT << "ms Avg: " << totalT / samples << "ms" << std::endl;
}
}
}
在发布模式下编译 运行 Visual Studio 2019 时,这是输出:
Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms Max: 2.47ms Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms Max: 1.234ms Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms Max: 0.759ms Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms Max: 0.578ms Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms Max: 0.676ms Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms Max: 0.999ms Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms Max: 0.786ms Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms Max: 0.948ms Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms Max: 0.504ms Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms Max: 0.702ms Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms Max: 1.104ms Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms Max: 1.01ms Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms Max: 3.577ms Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms Max: 0.792ms Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms Max: 0.723ms Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms Max: 0.858ms Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms Max: 3.501ms Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms Max: 4.809ms Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms Max: 14.394ms Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms Max: 8.572ms Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms Max: 15.739ms Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms Max: 16.787ms Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms Max: 39.971ms Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms Max: 45.473ms Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms Max: 31.844ms Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms Max: 21.199ms Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms Max: 21.608ms Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms Max: 18.779ms Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms Max: 26.991ms Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms Max: 27.701ms Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms Max: 26.351ms Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms Max: 40.517ms Avg: 22.0216ms
在第一批中,线程数增加的几个 运行s 被完成,使用相同数量的线程交替并行。此批处理产生预期的行为,随着线程数量的增加而提高性能。
然后完成第二批,运行使用相同的代码,但交替并行执行,其中一个使用的线程比另一个多。第二批有严重的性能损失,计算时间增加了 50~100 倍。
在 Ubuntu 中使用 gcc 进行编译和 运行ing 会导致预期的行为,两个批次的性能相似。
所以,问题是,在使用 Visual Studio 时,是什么导致了这种巨大的性能损失?
至于问题评论中解释的实验,由于缺乏更好的解释,这似乎是 VS 运行时中的一个错误。
以下代码更改用于交替并行 for 的并行线程数。
#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>
std::vector<float> v;
float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
float total = 0;
std::vector<int>nthreads{threadsFirst,threadsSecond};
for (int nthread : nthreads) {
omp_set_num_threads(nthread);
#pragma omp parallel for
for (int i = 0; i < tasks; ++i) {
for (int n = 0; n < perTaskComputation; ++n) {
if (v[i] > 5) {
v[i] * 0.002;
}
v[i] *= 1.1F * (i + 1);
}
}
for (auto a : v) {
total += a;
}
}
return total;
}
int main()
{
int tasks = 1000;
int load = 1000;
v.resize(tasks, 1);
for (int threadAdd = 0; threadAdd <= 1; ++threadAdd) {
std::cout << "Run batch\n";
for (int j = 1; j <= 16; j += 1) {
float minT = 1e100;
float maxT = 0;
float totalT = 0;
int samples = 0;
int iters = 100;
for (float i = 0; i <= iters; ++i) {
auto start = std::chrono::steady_clock::now();
foo(tasks, load, j, j + threadAdd);
auto end = std::chrono::steady_clock::now();
float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
if (i > 20) {
minT = std::min(minT, ms);
maxT = std::max(maxT, ms);
totalT += ms;
samples++;
}
}
std::cout << "Run parallel fors with " <<j << " and " << j + threadAdd << " threads -- Min: "
<< minT << "ms Max: " << maxT << "ms Avg: " << totalT / samples << "ms" << std::endl;
}
}
}
在发布模式下编译 运行 Visual Studio 2019 时,这是输出:
Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms Max: 2.47ms Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms Max: 1.234ms Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms Max: 0.759ms Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms Max: 0.578ms Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms Max: 0.676ms Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms Max: 0.999ms Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms Max: 0.786ms Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms Max: 0.948ms Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms Max: 0.504ms Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms Max: 0.702ms Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms Max: 1.104ms Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms Max: 1.01ms Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms Max: 3.577ms Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms Max: 0.792ms Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms Max: 0.723ms Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms Max: 0.858ms Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms Max: 3.501ms Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms Max: 4.809ms Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms Max: 14.394ms Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms Max: 8.572ms Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms Max: 15.739ms Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms Max: 16.787ms Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms Max: 39.971ms Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms Max: 45.473ms Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms Max: 31.844ms Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms Max: 21.199ms Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms Max: 21.608ms Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms Max: 18.779ms Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms Max: 26.991ms Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms Max: 27.701ms Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms Max: 26.351ms Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms Max: 40.517ms Avg: 22.0216ms
在第一批中,线程数增加的几个 运行s 被完成,使用相同数量的线程交替并行。此批处理产生预期的行为,随着线程数量的增加而提高性能。
然后完成第二批,运行使用相同的代码,但交替并行执行,其中一个使用的线程比另一个多。第二批有严重的性能损失,计算时间增加了 50~100 倍。
在 Ubuntu 中使用 gcc 进行编译和 运行ing 会导致预期的行为,两个批次的性能相似。
所以,问题是,在使用 Visual Studio 时,是什么导致了这种巨大的性能损失?
至于问题评论中解释的实验,由于缺乏更好的解释,这似乎是 VS 运行时中的一个错误。