为什么在多线程中使用 memcpy 性能会下降？

Question

我在 Linux 上写了一个简短的测试程序来测试 memcpy 在多线程中使用时的性能。我没想到它会如此毁灭性。执行时间从 3.8 秒增加到 2 分钟多，而运行程序的两个实例同时花费了大约 4.7 秒。这是为什么？

// thread example
#include <iostream>       
 #include <thread>         
#include <string.h>
using namespace std;

void foo(/*int a[3],int b[3]*/)
{
  int a[3]={7,8,3};
  int b[3]={9,8,2};

  for(int i=0;i<100000000;i++){
    memcpy(a,b,12*(rand()&1));
    }
}


int main()
{

#ifdef THREAD

  thread threads[4];
  for (char t=0; t<4; ++t) {
    threads[t] = thread( foo );
  }

  for (auto& th : threads) th.join();            
  cout << "foo and bar completed.\n";

#else

  foo();
  foo();
  foo();
  foo();

#endif

  return 0;
}

Answer 1

您的 memcpy 不执行任何操作，因为 12 * rand() & 1 始终是 0，因为它被读取为 (12 * rand()) & 1。由于 12 是偶数，结果总是 0.

所以您只是在测量 rand() 的时间，但该函数使用一个共享的全局状态，该状态可能（或可能不）由所有线程共享。看起来在您的实现中它是共享的并且它的访问是同步的，因此您的竞争很激烈并且性能受到影响。

尝试使用 rand_r()，它不使用共享状态（或新的和改进的 C++ 随机生成器）：

  unsigned int r = 0;
  for(int i=0;i<100000000;i++){
       rand_r(&r)
    }

在我的机器上，这将多线程运行时间从 30 秒减少到 0.7 秒（单线程为 2.2 秒）。当然，这个实验没有说明 memcpy()，但它说明了共享全局状态...

为什么在多线程中使用 memcpy 性能会下降？

Why memcpy performance deteriorates when used in multible threads?

c++

multithreading

memcpy