了解 std::hardware_destructive_interference_size 和 std::hardware_constructive_interference_size

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size

C++17 添加了 std::hardware_destructive_interference_size and std::hardware_constructive_interference_size。首先,我认为这只是一种获取 L1 缓存行大小的可移植方法,但这是过于简单化了。

问题:

这些常量的目的确实是获取缓存行大小。阅读他们的基本原理的最佳位置是提案本身:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html

为了便于阅读,我将在这里引用一段基本原理:

[...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.

Uses of cache-line size fall into two broad categories:

  • Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.
  • Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.

The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]

We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:

  • Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.
  • Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.

In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.


"How are these constants related to the L1 cache line size?"

理论上,很直接。

假设编译器确切地知道您将 运行 使用哪种架构 - 那么这些几乎肯定会为您提供准确的 L1 缓存行大小。 (如后所述,这是一个很大的假设。)

就其价值而言,我几乎总是希望这些值是相同的。我相信单独声明它们的唯一原因是为了完整性。 (也就是说,也许编译器想要估计 L2 缓存行大小而不是 L1 缓存行大小以进行建设性干扰;不过我不知道这是否真的有用。)


"Is there a good example that demonstrates their use cases?"

在这个答案的底部,我附上了一个很长的基准测试程序,它演示了虚假共享和真实共享。

它通过分配一个 int 包装器数组来演示错误共享:在一种情况下,多个元素适合 L1 缓存行,而在另一种情况下,单个元素占用 L1 缓存行。在紧密循环中,从数组中选择一个固定元素并重复更新。

它通过在包装器中分配一对整数来演示真正的共享:在一种情况下,这对整数中的两个整数不适合 L1 缓存行大小,而在另一种情况下它们适合。在紧密循环中,对中的每个元素都会重复更新。

请注意,访问被测对象的代码没有改变;唯一的区别是对象本身的布局和对齐方式。

我没有 C++17 编译器(假设大多数人目前也没有),所以我用我自己的常量替换了有问题的常量。您需要更新这些值以在您的机器上保持准确。也就是说,64 字节可能是典型现代桌面硬件的正确值(在撰写本文时)。

警告:测试将使用您机器上的所有内核,并分配约 256MB 的内存。不要忘记优化编译!

在我的机器上,输出是:

Hardware concurrency: 16
sizeof(naive_int): 4
alignof(naive_int): 4
sizeof(cache_int): 64
alignof(cache_int): 64
sizeof(bad_pair): 72
alignof(bad_pair): 4
sizeof(good_pair): 8
alignof(good_pair): 4
Running naive_int test.
Average time: 0.0873625 seconds, useless result: 3291773
Running cache_int test.
Average time: 0.024724 seconds, useless result: 3286020
Running bad_pair test.
Average time: 0.308667 seconds, useless result: 6396272
Running good_pair test.
Average time: 0.174936 seconds, useless result: 6668457

我通过避免虚假共享获得了约 3.5 倍的加速,通过确保真实共享获得了约 1.7 倍的加速。


"Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"

这确实是个问题。这些常量不能保证特别映射到目标机器上的任何缓存行大小,但旨在成为编译器可以收集的最佳近似值。

提案中对此进行了说明,在附录中,他们举例说明了一些库如何根据各种环境提示和宏在编译时尝试检测缓存行大小。你保证这个值至少是alignof(max_align_t),这是一个明显的下界

换句话说,这个值应该被用作你的后备案例;如果你知道,你可以自由定义一个精确的值,例如:

constexpr std::size_t cache_line_size() {
#ifdef KNOWN_L1_CACHE_LINE_SIZE
  return KNOWN_L1_CACHE_LINE_SIZE;
#else
  return std::hardware_destructive_interference_size;
#endif
}

在编译期间,如果你想假设一个缓存行大小,只需定义 KNOWN_L1_CACHE_LINE_SIZE.

希望对您有所帮助!

基准程序:

#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_destructive_interference_size = 64;

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_constructive_interference_size = 64;

constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;

typedef unsigned useless_result_t;
typedef double elapsed_secs_t;

//////// CODE TO BE SAMPLED:

// wraps an int, default alignment allows false-sharing
struct naive_int {
    int value;
};
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");

// wraps an int, cache alignment prevents false-sharing
struct cache_int {
    alignas(hardware_destructive_interference_size) int value;
};
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");

// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
    int first;
    char padding[hardware_constructive_interference_size];
    int second;
};
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");

// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
    int first;
    int second;
};
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");

// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& vec) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    auto& element = vec[vec.size() / 2 + thread_index];

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        element.value = dist(mt);
    }

    return static_cast<useless_result_t>(element.value);
}

// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& pair) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        pair.first = dist(mt);
        pair.second = dist(mt);
    }

    return static_cast<useless_result_t>(pair.first) +
        static_cast<useless_result_t>(pair.second);
}

//////// UTILITIES:

// utility: allow threads to wait until everyone is ready
class threadlatch {
public:
    explicit threadlatch(const std::size_t count) :
        count_{ count }
    {}

    void count_down_and_wait() {
        std::unique_lock<std::mutex> lock{ mutex_ };
        if (--count_ == 0) {
            cv_.notify_all();
        }
        else {
            cv_.wait(lock, [&] { return count_ == 0; });
        }
    }

private:
    std::mutex mutex_;
    std::condition_variable cv_;
    std::size_t count_;
};

// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    threadlatch latch{ num_threads + 1 };

    std::vector<std::future<useless_result_t>> futures;
    std::vector<std::thread> threads;
    for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
        std::packaged_task<useless_result_t()> task{
            std::bind(func, std::ref(latch), thread_index)
        };

        futures.push_back(task.get_future());
        threads.push_back(std::thread(std::move(task)));
    }

    const auto starttime = std::chrono::high_resolution_clock::now();

    latch.count_down_and_wait();
    for (auto& thread : threads) {
        thread.join();
    }

    const auto endtime = std::chrono::high_resolution_clock::now();
    const auto elapsed = std::chrono::duration_cast<
        std::chrono::duration<double>>(
            endtime - starttime
            ).count();

    useless_result_t result = 0;
    for (auto& future : futures) {
        result += future.get();
    }

    return std::make_tuple(result, elapsed);
}

// utility: sample the time it takes to run func on N threads
void run_tests(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    useless_result_t final_result = 0;
    double avgtime = 0.0;
    for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
        const auto result_and_elapsed = run_threads(func, num_threads);
        const auto result = std::get<useless_result_t>(result_and_elapsed);
        const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);

        final_result += result;
        avgtime = (avgtime * trial + elapsed) / (trial + 1);
    }

    std::cout
        << "Average time: " << avgtime
        << " seconds, useless result: " << final_result
        << std::endl;
}

int main() {
    const auto cores = std::thread::hardware_concurrency();
    std::cout << "Hardware concurrency: " << cores << std::endl;

    std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
    std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
    std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
    std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
    std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
    std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
    std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
    std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;

    {
        std::cout << "Running naive_int test." << std::endl;

        std::vector<naive_int> vec;
        vec.resize((1u << 28) / sizeof(naive_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running cache_int test." << std::endl;

        std::vector<cache_int> vec;
        vec.resize((1u << 28) / sizeof(cache_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running bad_pair test." << std::endl;

        bad_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
    {
        std::cout << "Running good_pair test." << std::endl;

        good_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
}

I would almost always expect these values to be the same.

关于以上,我想为已接受的答案做出一点小小的贡献。前一段时间,我看到一个很好的用例,其中这两个应该在 folly 库中单独定义。请参阅有关 Intel Sandy Bridge 处理器的警告。

https://github.com/facebook/folly/blob/3af92dbe6849c4892a1fe1f9366306a2f5cbe6a0/folly/lang/Align.h

//  Memory locations within the same cache line are subject to destructive
//  interference, also known as false sharing, which is when concurrent
//  accesses to these different memory locations from different cores, where at
//  least one of the concurrent accesses is or involves a store operation,
//  induce contention and harm performance.
//
//  Microbenchmarks indicate that pairs of cache lines also see destructive
//  interference under heavy use of atomic operations, as observed for atomic
//  increment on Sandy Bridge.
//
//  We assume a cache line size of 64, so we use a cache line pair size of 128
//  to avoid destructive interference.
//
//  mimic: std::hardware_destructive_interference_size, C++17
constexpr std::size_t hardware_destructive_interference_size =
    kIsArchArm ? 64 : 128;
static_assert(hardware_destructive_interference_size >= max_align_v, "math?");

//  Memory locations within the same cache line are subject to constructive
//  interference, also known as true sharing, which is when accesses to some
//  memory locations induce all memory locations within the same cache line to
//  be cached, benefiting subsequent accesses to different memory locations
//  within the same cache line and heping performance.
//
//  mimic: std::hardware_constructive_interference_size, C++17
constexpr std::size_t hardware_constructive_interference_size = 64;
static_assert(hardware_constructive_interference_size >= max_align_v, "math?");

我已经测试了上面的代码,但我认为有一个小错误阻止我们理解底层功能,不应在两个不同的原子之间共享单个缓存行以防止错误共享。 我已经更改了这些结构的定义。

struct naive_int
{
    alignas ( sizeof ( int ) ) atomic < int >               value;
};

struct cache_int
{
    alignas ( hardware_constructive_interference_size ) atomic < int >  value;
};

struct bad_pair
{
    // two atomics sharing a single 64 bytes cache line 
    alignas ( hardware_constructive_interference_size ) atomic < int >  first;
    atomic < int >                              second;
};

struct good_pair
{
    // first cache line begins here
    alignas ( hardware_constructive_interference_size ) atomic < int >  
                                                first;
    // That one is still in the first cache line
    atomic < int >                              first_s; 
    // second cache line starts here
    alignas ( hardware_constructive_interference_size ) atomic < int >
                                                second;
    // That one is still in the second cache line
    atomic < int >                              second_s;
};

结果运行:

Hardware concurrency := 40
sizeof(naive_int)    := 4
alignof(naive_int)   := 4
sizeof(cache_int)    := 64
alignof(cache_int)   := 64
sizeof(bad_pair)     := 64
alignof(bad_pair)    := 64
sizeof(good_pair)    := 128
alignof(good_pair)   := 64
Running naive_int test.
Average time: 0.060303 seconds, useless result: 8212147
Running cache_int test.
Average time: 0.0109432 seconds, useless result: 8113799
Running bad_pair test.
Average time: 0.162636 seconds, useless result: 16289887
Running good_pair test.
Average time: 0.129472 seconds, useless result: 16420417

我在最后的结果中经历了很多差异,但从未将任何核心精确地用于该特定问题。无论如何,这 运行 2 Xeon 2690V2 和各种 运行 使用 64 或 128 作为 hardware_constructive_interference_size = 128 我发现 64 足够了,而 128 对可用缓存的使用非常差。

我突然意识到你的问题帮助我理解了什么是 Jeff Preshing 在谈论,所有关于有效载荷!?