OpenMP 减慢不相关的串行循环

Question

我有两个不相关的 for 循环，一个是串行执行的，一个是使用 OpenMP 并行 for 构造执行的。

我使用的 OpenMP-Threads 越多，下一个串行代码就会变得越慢。

class Foo {
public:
    Foo(size_t size) {
        parallel_vector.resize(size, 0.0);
        serial_vector.resize(size, 0.0);
    }

    void do_serial_work() {
        std::mt19937 random_number_generator;
        std::uniform_real_distribution<double> random_number_distribution{ 0.0, 1.0 };

        for (size_t i = 0; i < serial_vector.size(); i++) {
            serial_vector[i] = random_number_distribution(random_number_generator);
        }
    }

    void do_parallel_work() {
#pragma omp parallel for
        for (auto i = 0; i < parallel_vector.size(); ++i) {
            for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
                parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
            }
        }
    }

private:
    std::vector<double> parallel_vector;
    std::vector<double> serial_vector;
};

void test_with_size(size_t size, int num_threads) {
    std::cout << "Testing with " << num_threads << " and size: " << size << "\n";
    omp_set_num_threads(num_threads);

    Foo foo{ size };

    long long total_dur_1 = 0;
    long long total_dur_2 = 0;

    for (auto i = 0; i < 500; i++) {
        const auto tp_1 = std::chrono::high_resolution_clock::now();
        foo.do_serial_work();
        
        const auto tp_2 = std::chrono::high_resolution_clock::now();
        foo.do_parallel_work();

        const auto tp_3 = std::chrono::high_resolution_clock::now();
        const auto dur_1 = std::chrono::duration_cast<std::chrono::microseconds>(tp_2 - tp_1).count();
        const auto dur_2 = std::chrono::duration_cast<std::chrono::microseconds>(tp_3 - tp_2).count();

        total_dur_1 += dur_1;
        total_dur_2 += dur_2;
    }

    std::cout << total_dur_1 << "\t" << total_dur_2 << "\n";
}

int main(int argc, char** argv) {
    test_with_size(100000, 1);
    test_with_size(100000, 2);
    test_with_size(100000, 4);
    test_with_size(100000, 8);

    return 0;
}

速度变慢发生在我的本地机器上，这是一台 Win10 笔记本电脑，配备 Intel Core i7-7700，4 核和超线程，24 GB 内存。编译器是VisualStudio 2019最新的。在RelWithDebugMode中编译（来自CMake，包括/O2和/openmp）。

当我使用更强大的机器时，它不会发生，CentOS 8 配备 2 个 Intel Xeon Platinum 9242，每个 48 个内核，没有超线程，769 GB RAM。编译器是gcc/8.3.1。用 g++ --std=c++17 -O3 -fopenmp.

编译

Win10 i7-7700 上的时间：

Testing with 1 and size: 100000
3043846 10536315
Testing with 2 and size: 100000
3276611 5350204
Testing with 4 and size: 100000
3937311 2735655
Testing with 8 and size: 100000
5002727 1598775

在 CentOS 8 上，2x Xeon Platinum 9242：

Testing with 1 and size: 100000
727756  4111363
Testing with 2 and size: 100000
731649  2069257
Testing with 4 and size: 100000
734019  1056157
Testing with 8 and size: 100000
752584  544373

所以我最初的想法是“缓存压力太大”。但是，当我从并行部分中删除除循环之外的几乎所有内容时，减速再次发生。

已删除工作的更新平行部分：

    void do_parallel_work() {
#pragma omp parallel for
        for (auto i = 0; i < 8; ++i) {
            //for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
            //    parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
            //}
        }
    }

更新并行部分的 Win10 时间：

Testing with 1 and size: 100000
3206293 636
Testing with 2 and size: 100000
3218667 2672
Testing with 4 and size: 100000
3928818 8689
Testing with 8 and size: 100000
5106605 10797

查看 OpenMP 2.0 标准（VS 只支持 2.0）（在这里找到它：https://www.openmp.org/specifications/），它在 2.7.2.5 行 7,8 中说：

In the absence of an explicit default clause, the default behavior is the same as if the default(shared) were specified.

并且在 2.7.2.4 第 30 行：

All threads within the team access the same storage area for shared variables.

对我来说，这排除了 OpenMP 线程每个副本 serial_vector，这是我能想到的最后一个解释。

我很高兴就此事进行任何解释/讨论，即使我只是明显遗漏了一些东西。

编辑：

出于好奇，我也在我的 Win10 机器上用 WSL 进行了测试。运行 gcc/9.3.0，时间为：

Testing with 1 and size: 100000
833678  2752
Testing with 2 and size: 100000
762877  1863
Testing with 4 and size: 100000
816440  1860
Testing with 8 and size: 100000
991184  2350

老实说，我不确定为什么 windows 可执行文件在同一台机器上花费的时间比 linux 长得多（VC++ 的优化 /O2 是最大值），但有趣的是，这里不会出现相同的工件。

Answer 1

Windows 上的 OpenMP 默认具有 200 毫秒自旋锁。这意味着当您离开 omp 块时，所有 omp 工作线程都在旋转以等待新工作。如果您有许多彼此相邻的 omp 块，它会有好处。在您的情况下，线程仅消耗 CPU 功率。

要disable/control 自旋锁，您有几种选择：

定义环境变量 OMP_WAIT_POLICY 并将其设置为 PASSIVE 以完全禁用自旋锁，
切换到 Intel OMP Runtime 随 OneAPI 一起提供。然后你可以通过定义KMP_BLOCKTIME环境变量，
安装 Visual Studio 2019 Preview（应该很快就会在正式版中）并使用 llvm omp。那么你也可以通过定义KMP_BLOCKTIME环境变量来控制自旋锁时间。

OpenMP 减慢不相关的串行循环

OpenMP slows down unrelated serial loop

c++

performance

benchmarking

x86

openmp