thread_local 的内存使用量如何随线程数变化？

Question

我认为 C/C++ 标准没有说明复杂性，所以我对具体的实现很好奇（我认为它们都有相同的行为）。

假设我有以下 C++ 函数。

void fn() {
    thread_local char arr[1024*1024]{};
    // do something with arr
}

我的程序有 80 个线程，其中 47 个线程至少执行一次运行 fn().

我的程序的内存使用量是否增长了某个常量的 47 倍，某个常量的 80 倍，或者是否有其他公式？

注意：有 this Java 问题由于某种原因被关闭了，但是 IDK if Java 使用与 C/C++ 相同的原语。

Answer 1

根据C++11标准：

3.7.2 Thread storage duration [ basic.stc.thread ]

1 All variables declared with the thread_local keyword have thread storage duration. The storage for these entities shall last for the duration of the thread in which they are created. There is a distinct object or reference per thread, and use of the declared name refers to the entity associated with the current thread.

2 A variable with thread storage duration shall be initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.

它说，“这些实体的存储应在创建它们的线程的持续时间内持续。”。因此，根据我的阅读，内存必须是为所有线程分配。

但是，它们只是initialized和destructed，如果它们被使用："具有线程存储持续时间的变量应在其第一次 ODR 使用 (3.2) 之前初始化，如果构造，则应在线程退出时销毁。

Answer 2

虽然您可以很容易地验证您的实施行为，但这可能在很大程度上取决于实施。例如运行 windows 上的以下程序（使用调试 visual studio 构建以避免优化删除未使用的代码）：

#include <iostream>
#include <array>
#include <thread>

struct Foo
{
    std::array<char, 1'000'000'000> data;
};

void bar()
{
    thread_local Foo foo;
    for (int i = 0; i < foo.data.size(); i++)
    {
        foo.data[i] = i;
    }
    std::this_thread::sleep_for(std::chrono::seconds(1000));
}

int main()
{
    std::thread thread1([]
    {
        bar();
    });

    std::thread thread2([]
    {
        std::this_thread::sleep_for(std::chrono::seconds(1000));
    });

    thread1.join();
    thread2.join();
}

使用 3GB 内存（1GB 用于两个线程，1GB 用于主线程）。删除 thread2 会将内存使用量降至 2GB。在 Linux 上，此行为可能会有所不同，因为它过度分配并且未使用的内存页在使用之前不会分配。

您可以通过使用智能指针仅在实际使用时分配内存来避免这种情况，例如将 bar 更改为：

void bar()
{
    thread_local std::unique_ptr<Foo> foo = std::make_unique<Foo>();
    for (int i = 0; i < foo->data.size(); i++)
    {
        foo->data[i] = i;
    }
    std::this_thread::sleep_for(std::chrono::seconds(1000));
}

将内存使用量减少到 1GB，因为只有 thread1 实际分配了大数组，thread2 并且主线程只需要存储 unique_ptr.

thread_local 的内存使用量如何随线程数变化？

How does memory usage of thread_local scale with number of threads?

c

c++

thread-local-storage