为什么 8 个线程比 2 个线程慢？

Question

首先我要为我糟糕的英语道歉。我现在正在学习硬件事务内存，我正在使用 TBB 中的 spin_rw_mutex.h 来实现 C++ 中的事务块。 speculative_spin_rw_mutex是一个class中的一个spin_rw_mutex.h是一个已经实现了intel TSX的RTM接口的mutex。

我用来测试RTM的例子很简单。我创建了账户 class 并随机将钱从一个账户转账到另一个账户。所有账户都在一个账户数组中，大小为100。随机函数在boost中。（我认为STL具有相同的随机函数）。传递函数受 speculative_spin_rw_mutex 保护。我使用 tbb::parallel_for 和 tbb::task_scheduler_init 来控制并发。所有传输方法都在 paraller_for 的 lambda 中调用。总传输次数为100万次。奇怪的是当 task_scheduler_init 设置为 2 时程序是最快的（8 秒）。事实上，我的 CPU 是 i7 6700k，它有 8 个线程。在 8 到 50,000 的范围内，程序的性能几乎没有变化（11 到 12 秒）。当我将 task_scheduler_init 增加到 100,000 时，运行时间将增加到大约 18 秒。我尝试用profiler分析程序，发现hotspot函数是mutex。但是我认为事务回滚率并没有那么高。不知道为什么程序这么慢

有人说虚假共享会降低性能，结果我尝试使用

std::vector>cache_aligned_accounts(账户大小，账户(1000));

替换原数组

帐户* 个帐户[帐户大小]；

避免虚假分享。似乎没有任何改变；这是我的新代码。



#include <tbb/spin_rw_mutex.h>
#include <iostream>
#include "tbb/task_scheduler_init.h"  
#include "tbb/task.h"
#include "boost/random.hpp"
#include <ctime>
#include <tbb/parallel_for.h>
#include <tbb/spin_mutex.h>
#include <tbb/cache_aligned_allocator.h>
#include <vector>
using namespace tbb;
tbb::speculative_spin_rw_mutex mu;

class Account {
private:
    int balance;
public:
    Account(int ba) {
        balance = ba;
    }
    int getBalance() {
        return balance;
    }
    void setBalance(int ba) {
        balance = ba;
    }
};

//Transfer function. Using speculative_spin_mutex to set critical section
void transfer(Account &from, Account &to, int amount) {
    speculative_spin_rw_mutex::scoped_lock lock(mu);
    if ((from.getBalance())<amount)
    {
        throw std::invalid_argument("Illegal amount!");
    }
    else {
        from.setBalance((from.getBalance()) - amount);
        to.setBalance((to.getBalance()) + amount);
    }
}

const int AccountsSIZE = 100;

//Random number generater and distributer
boost::random::mt19937 gener(time(0));
boost::random::uniform_int_distribution<> distIndex(0, AccountsSIZE - 1);
boost::random::uniform_int_distribution<> distAmount(1, 1000);
/*
Function of transfer money
*/
void all_transfer_task() {
    task_scheduler_init init(10000);//Set the number of tasks can be run together
    /*
    Initial accounts, using cache_aligned_allocator to avoid false sharing
    */
    std::vector<Account, cache_aligned_allocator<Account>> cache_aligned_accounts(AccountsSIZE,Account(1000));

    const int TransferTIMES = 10000000;
    //All transfer tasks
    parallel_for(0, TransferTIMES, 1, [&](int i) {

        try {
            transfer(cache_aligned_accounts[distIndex(gener)], cache_aligned_accounts[distIndex(gener)], distAmount(gener));
        }
        catch (const std::exception& e)
        {
            //cerr << e.what() << endl;
        }
        //std::cout << distIndex(gener) << std::endl;
    });

    std::cout << cache_aligned_accounts[0].getBalance() << std::endl;

    int total_balance = 0;
    for (size_t i = 0; i < AccountsSIZE; i++)
    {
        total_balance += (cache_aligned_accounts[i].getBalance());
    }
    std::cout << total_balance << std::endl;
}

Answer 1

虽然我无法重现您的基准测试，但我发现此行为有两个可能的原因：

"Too many cooks boil the soup"：您使用单个 spin_rw_mutex，它被所有线程的所有传输锁定。在我看来，您的传输是按顺序执行的。这可以解释为什么配置文件会在那里看到热点。在这种情况下，英特尔页面会警告性能下降。
吞吐量与速度：在 i7 上，当您使用更多内核时，每个内核的运行速度会稍微慢一点，因此固定 siez 循环的总时间会更长。然而，计算总吞吐量（即所有这些并行循环中发生的事务总数）吞吐量要高得多（尽管与内核数量不完全成比例）。

我宁愿选择第一种情况，但不排除第二种情况。

Answer 2

由于英特尔 TSX 在高速缓存行粒度上工作，错误共享绝对是开始的事情。不幸的是，cache_aligned_allocator 并不符合您的预期，即它与整个 std::vector 对齐，但您需要单个帐户占用整个缓存行以防止错误共享。

为什么 8 个线程比 2 个线程慢？

Why 8 threads is slower than 2 threads?

c++

multithreading

tbb

intel-tsx