为什么 std::scoped_lock 的不同互斥顺序会影响性能？

Question

注意：这不是一个实际问题（我从来没有用 scoped_lock 锁定超过 2 个互斥量），我很好奇为什么 scoped_lock 的实现在以不同顺序锁定互斥量时显然有相对较大的性能影响。

下面的示例代码，godbolt link。

#include<mutex>
#include<thread>
#include<chrono>
#include<iostream>

std::mutex m1, m2, m3, m4, m5, m6;

int cnt =0;

void f(){
    for (int i=0; i< 500*1000; ++i){
        std::scoped_lock sl{m1, m2, m3, m4, m5, m6};
        cnt++;
    }
}

void f_unord(){
    for (int i=0; i< 500*1000; ++i){
        std::scoped_lock sl{m4, m5, m6, m1, m2, m3};
        cnt++;
    }
}


int main(){
for (int run = 0; run<4; ++run)
{
    {
        const auto start = std::chrono::steady_clock::now();
        std::thread t1(f), t2(f);
        t1.join();
        t2.join();
        const auto end = std::chrono::steady_clock::now();
        std::cout << "same lock order: " << std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() << std::endl; 
        std::cout << cnt << std::endl;
    }
    {
        const auto start = std::chrono::steady_clock::now();
        std::thread t1(f), t2(f_unord);
        t1.join();
        t2.join();
        const auto end = std::chrono::steady_clock::now();
        std::cout << "different lock order: " << std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() << std::endl; 
        std::cout << cnt << std::endl;
    }
}
}

请注意为什么这令人惊讶：我希望由于互斥量不可移动，因此实现只能按地址对互斥量进行排序并使用该锁定顺序。

关于 godbolt 基准测试的注意事项：我知道 godbolt 不可靠，我在我的虚拟机机器上得到了类似的结果：

g++ --version; g++ -O2 -std=c++17 scoped_lock.cpp -pthread; ./a.out

g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

different lock order: 1074

1000000

same lock order: 602

2000000

different lock order: 987

3000000

same lock order: 612

4000000

different lock order: 1012

5000000

same lock order: 585

6000000

different lock order: 1050

7000000

same lock order: 675

8000000

different lock order: 1107

9000000

same lock order: 609

10000000

Answer 1

当两个线程使用相同的互斥顺序时，不会发生死锁。如果线程 t1 尚未锁定 m1 和 vice-versa，则线程 t2 只能继续锁定 m1。不会出现死锁。

如果您对两个线程使用不同的顺序，则会发生死锁。即线程 t1 已锁定 m1、m2 和 m3，并尝试锁定 m4、m5 和 m6.但是，同时线程 t2 已锁定 m4、m5 和 m6 并尝试锁定 m1、m2 和 m3。两个线程无法进行，需要解决死锁

在这种情况下，任何一个作用域锁都必须释放它已经获得的互斥量以避免死锁。然后另一个线程可以继续，然后该线程必须再次获取所有互斥锁，并且在下一次迭代中再次发生同样的情况。

Answer 2

它与实施有关。我们可以想象 std::scoped_lock 在某些常规实现中使用 std::lock。

当您查看 std::lock doc 时：

The objects are locked by an unspecified series of calls to lock, try_lock, and unlock. If a call to lock or unlock results in an exception, unlock is called for any locked objects before rethrowing

std::lock 的 gcc 实现是：

void
    lock(_L1& __l1, _L2& __l2, _L3&... __l3)
    {
      while (true)
        {
          using __try_locker = __try_lock_impl<0, sizeof...(_L3) != 0>;
          unique_lock<_L1> __first(__l1);
          int __idx;
          auto __locks = std::tie(__l2, __l3...);
          __try_locker::__do_try_lock(__locks, __idx);
          if (__idx == -1)
            {
              __first.release();
              return;
            }
        }
    }

如你所见，如果你有相同的声明顺序，这很简单：

t1 std::lock 取得 m1
t2 std::lock 等待 m1 被释放

在第二种情况下，可能需要一些时间才能稳定（理论上永远不会发生）...

Answer 3

正如其他人所说，它与实施相关。但是这个实现可以比 gcc 的 persistent 一遍又一遍地尝试同样的事情做得更好。

Peristent

Lock the first lock and then try_lock the rest. If any of the try_locks fail, unlock everything and try again.

如果两个线程以相同的顺序列出它们的互斥量，则此算法效果最佳。

为了获得更高性能和更稳健的算法，实施应该使用 this paper 所谓的智能和礼貌。

Smart & Polite

Lock the first lock and then try_lock the rest. If any of the try_locks fail, unlock everything, then yield, then retry except the first lock is done on the one that previously failed the try_lock.

The paper 表明该算法的性能绝不会比其他算法差，而且通常性能要好得多。此分析包括更传统的算法，该算法将可锁定对象排序为全局顺序，然后按该顺序锁定它们（其中标记为 Ordered）。

libc++ and Visual Studio both use Smart & Polite. gcc's libstdc++ 使用 Persistent.

在 non-Apple 平台上使用 clang 时，使用 -stdlib=libc++ 选择 libc++ 而不是 gcc 的 std::lib。

阅读 Dining Philosophers Rebooted 以深入分析 std::lock 的这些算法。

为什么 std::scoped_lock 的不同互斥顺序会影响性能？

Why does the different order of mutexes for std::scoped_lock affects performance?

c++

performance

mutex