C++ 11 标准线程总和与原子非常慢

Question

我想学习将 C++ 11 std::threads 与 VS2012 一起使用，并且我编写了一个非常简单的 C++ 控制台程序，其中包含两个线程，这些线程只是递增一个计数器。我还想测试使用两个线程时的性能差异。测试程序如下：

#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>

std::atomic<long long> sum(0);
//long long sum;

using namespace std;

const int RANGE = 100000000;

void test_without_threds()
{
    sum = 0;
    for(unsigned int j = 0; j < 2; j++)
    for(unsigned int k = 0; k < RANGE; k++)
        sum ++ ;
}

void call_from_thread(int tid) 
{
    for(unsigned int k = 0; k < RANGE; k++)
        sum ++ ;
}

void test_with_2_threds()
{
    std::thread t[2];
    sum = 0;
    //Launch a group of threads
    for (int i = 0; i < 2; ++i) {
        t[i] = std::thread(call_from_thread, i);
    }

    //Join the threads with the main thread
    for (int i = 0; i < 2; ++i) {
        t[i].join();
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    chrono::time_point<chrono::system_clock> start, end;

    cout << "-----------------------------------------\n";
    cout << "test without threds()\n";

    start = chrono::system_clock::now();
    test_without_threds();
    end = chrono::system_clock::now();

    chrono::duration<double> elapsed_seconds = end-start;

    cout << "finished calculation for "
              << chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << "ms.\n";

    cout << "sum:\t" << sum << "\n";\

    cout << "-----------------------------------------\n";
    cout << "test with 2_threds\n";

    start = chrono::system_clock::now();
    test_with_2_threds();
    end = chrono::system_clock::now();

    cout << "finished calculation for "
              << chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << "ms.\n";

    cout << "sum:\t" << sum << "\n";\

    _getch();
    return 0;
}

现在，当我将 long long 变量（已注释）用于计数器时，我得到的值与正确值不同 - 100000000 而不是 200000000。我不确定为什么会这样，我想两个线程同时更改计数器，但我不确定它是如何发生的，因为 ++ 只是一个非常简单的指令。似乎线程在开始时缓存了 sum 变量。两个线程的性能为 110 毫秒，而一个线程为 200 毫秒。

所以根据文档正确的方法是使用std::atomic。但是现在这两种情况的性能都差得多，没有线程大约 3300 毫秒，有线程大约 15820 毫秒。在这种情况下，使用 std::atomic 的正确方法是什么？

Answer 1

I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.

每个线程都将 sum 的值拉入寄存器，递增寄存器，最后在循环结束时将其写回内存。

So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?

您需要为 std::atomic 提供的同步付费。它不会像使用非同步整数一样快，尽管您可以通过改进 add:

的内存顺序来稍微提高性能

sum.fetch_add(1, std::memory_order_relaxed);

在这种特殊情况下，您正在为 x86 编译并在 64 位整数上运行。这意味着编译器必须生成代码来更新两个 32 位操作中的值；如果将目标平台更改为 x64，编译器将生成代码以在单个 64 位操作中执行递增。

一般来说，解决此类问题的方法是减少对共享数据的写入次数。

Answer 2

您的代码有几个问题。首先，涉及到的"inputs"都是编译时常量，所以好的编译器可以为单线程代码预先计算出值，所以（不管你给range取什么值）它在 0 毫秒内显示为运行ning。

其次，您在所有线程之间共享一个变量 (sum)，迫使它们的所有访问在此时同步。没有同步，就会产生未定义的行为。正如您已经发现的那样，同步对该变量的访问相当昂贵，因此如果合理的话，您通常希望避免它。

一种方法是为每个线程使用单独的小计，这样它们就可以并行地进行加法，而无需同步，最后将各个结果加在一起。

还有一点就是防止虚假分享。当两个（或更多）线程正在写入真正独立但已分配在同一缓存行中的数据时，就会出现错误共享。在这种情况下，即使（如前所述）您实际上没有在线程之间共享任何数据，也可以序列化对内存的访问。

基于这些因素，我稍微重写了您的代码，为每个线程创建了一个单独的 sum 变量。这些变量属于 class 类型，可以（相当）直接访问数据，但会阻止优化器看到它可以在编译时完成整个计算，所以我们最终将一个线程与 4 个线程进行比较（这提醒我：我确实将线程数从 2 增加到 4，因为我使用的是四核机器）。不过，我将该数字移到了一个 const 变量中，因此使用不同数量的线程进行测试应该很容易。

#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>

const int num_threads = 4;

struct val {
    long long sum;
    int pad[2];

    val &operator=(long long i) { sum = i; return *this; }
    operator long long &() { return sum; }
    operator long long() const { return sum; }
};

val sum[num_threads];

using namespace std;

const int RANGE = 100000000;

void test_without_threds()
{
    sum[0] = 0LL;
    for(unsigned int j = 0; j < num_threads; j++)
    for(unsigned int k = 0; k < RANGE; k++)
        sum[0] ++ ;
}

void call_from_thread(int tid) 
{
    for(unsigned int k = 0; k < RANGE; k++)
        sum[tid] ++ ;
}

void test_with_threads()
{
    std::thread t[num_threads];
    std::fill_n(sum, num_threads, 0);
    //Launch a group of threads
    for (int i = 0; i < num_threads; ++i) {
        t[i] = std::thread(call_from_thread, i);
    }

    //Join the threads with the main thread
    for (int i = 0; i < num_threads; ++i) {
        t[i].join();
    }
    long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}

int main()
{
    chrono::time_point<chrono::system_clock> start, end;

    cout << "-----------------------------------------\n";
    cout << "test without threds()\n";

    start = chrono::system_clock::now();
    test_without_threds();
    end = chrono::system_clock::now();

    chrono::duration<double> elapsed_seconds = end-start;

    cout << "finished calculation for "
              << chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << "ms.\n";

    cout << "sum:\t" << sum << "\n";\

    cout << "-----------------------------------------\n";
    cout << "test with threads\n";

    start = chrono::system_clock::now();
    test_with_threads();
    end = chrono::system_clock::now();

    cout << "finished calculation for "
              << chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << "ms.\n";

    cout << "sum:\t" << sum << "\n";\

    _getch();
    return 0;
}

当我运行这样做时，我的结果更接近我猜你希望的结果：

-----------------------------------------
test without threds()
finished calculation for 78ms.
sum:    000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum:    000000013FCBC370

...总和相同，但 N 个线程将速度提高了大约 N 倍（取决于可用内核的数量）。

Answer 3

尝试使用前缀增量，这将提高性能。在我的机器上测试，std::memory_order_relaxed 没有任何优势。

C++ 11 标准线程总和与原子非常慢

C++ 11 std thread sumation with atomic very slow

c++

atomic

c++11

stdthread

visual-studio-2012