thread_local 在块范围

Question

在块范围内使用 thread_local 变量有什么用？

如果一个可编译的样本有助于说明问题，这里是：

#include <thread>
#include <iostream>

namespace My {
    void f(int *const p) {++*p;}
}

int main()
{
    thread_local int n {42};
    std::thread t(My::f, &n);
    t.join();
    std::cout << n << "\n";
    return 0;
}

输出：43

在示例中，新线程获得了它自己的 n 但是（据我所知）不能用它做任何有趣的事情，所以何必呢？新线程自己的n有没有用？如果没有用，那又有什么意义呢？

当然，我假设是一个点。我只是不知道这有什么意义。这就是我问的原因。

如果新线程自己的 n 想要（如我所想）在运行时由 CPU 进行特殊处理——可能是因为在机器代码级别，人们无法访问自己的 n 以正常方式通过预先计算的新线程堆栈基指针的偏移量——那么我们是否只是在浪费机器周期和电力而没有任何收获？然而，即使不需要特殊处理，仍然没有收获！不是我能看到的。

那么为什么要在块范围内 thread_local？

参考资料

thread_local and other storage classes
之前的一个问题：
另一个较早的问题：
另一个较早的问题：the cost of thread_local

Answer 1

我发现 thread_local 仅在三种情况下有用：

如果您需要每个线程都有一个唯一的资源，这样它们就不必共享、互斥等来使用所述资源。即便如此，这仅在资源很大 and/or 创建成本高或需要跨函数调用持久化时才有用（即函数内的局部变量不够用）。
(1) 的一个分支 - 当调用线程最终终止时，您可能需要特殊的逻辑来运行。为此，您可以使用函数中创建的 thread_local 对象的析构函数。对于每个使用 thread_local 声明进入代码块的线程（在线程生命周期结束时），都会调用一次这样的 thread_local 对象的析构函数。
您可能需要为调用它的每个唯一线程执行一些其他逻辑，但只需执行一次。例如，您可以编写一个函数来注册调用函数的每个唯一线程。这听起来可能很奇怪，但我发现它可以用于管理我正在开发的库中的垃圾收集资源。此用法与 (1) 密切相关，但在构建后未被使用。在线程的整个生命周期中实际上是一个哨兵对象。

Answer 2

抛开 Cruz Jean 已经给出的很好的例子（我认为我无法添加），还要考虑以下几点：没有理由禁止它。我不认为你怀疑 thread_local 的用处或质疑为什么它应该在一般语言中。 thread_local 块作用域变量具有明确定义的含义，这仅仅是由于存储类和作用域在 C++ 中的工作方式。不能仅仅因为想不出某种 "interesting" 与每一种可能的语言特性组合有关，并不意味着必须明确禁止所有没有至少一种已知 "interesting" 应用程序的语言特性组合.按照这种逻辑，我们还必须继续并禁止类没有私人成员拥有朋友等等。至少对我来说，C++ 似乎特别遵循 "if there's no specific technical reason why feature X cannot work in situation Y, then there's no reason to forbid it" 的哲学，我认为这是一种非常健康的方法。无缘无故地禁止事物意味着无缘无故地增加复杂性。而且我相信每个人都会同意 C++ 中已经有足够的复杂性。它还可以防止意外发生，例如，仅在多年之后，突然发现某种语言功能具有以前未曾想到的应用。这种情况最突出的例子可能是模板（至少据我所知）最初并不是为了元编程的目的而构思的；后来才发现它们也可以用于那个……

Answer 3

首先注意块局部线程局部is implicitly static thread_local。换句话说，您的示例代码等效于：

int main()
{
    static thread_local int n {42};
    std::thread t(My::f, &n);
    t.join();
    std::cout << n << "\n"; // prints 43
    return 0;
}

在函数中用 thread_local 声明的变量与全局定义的 thread_locals 没有太大区别。在这两种情况下，您都会创建一个每个线程唯一的对象，并且其生命周期与线程的生命周期绑定。

区别只是全局定义的thread_locals会被初始化. In contrast, a block-local thread-local variable is initialized the first time control passes through its declaration.

一个用例是通过定义在线程的生命周期内重复使用的本地缓存来加速函数：

void foo() {
  static thread_local MyCache cache;
  // ...
}

（我在这里使用 static thread_local 来明确表示如果该函数在同一个线程中多次执行，缓存将被重用，但这是一个品味问题。如果你删除 static, 不会有任何区别。)

关于您的示例代码的评论。也许这是故意的，但线程并没有真正访问 thread_local n。相反，它对指针的副本进行操作，该副本由线程运行 main 创建。因为这两个线程都引用相同的内存。

换句话说，更详细的方式是：

int main()
{
    thread_local int n {42};
    int* n_ = &n;
    std::thread t(My::f, n_);
    t.join();
    std::cout << n << "\n"; // prints 43
    return 0;
}

如果改代码，所以线程访问n，会在自己的版本上运行，属于主线程的n不会被修改：

int main()
{
    thread_local int n {42};
    std::thread t([&] { My::f(&n); });
    t.join();
    std::cout << n << "\n"; // prints 42 (not 43)
    return 0;
}

这是一个更复杂的例子。它调用该函数两次以表明在调用之间保留了状态。它的输出也显示线程在它们自己的状态下运行：

#include <iostream>
#include <thread>

void foo() {
  thread_local int n = 1;
  std::cout << "n=" << n << " (main)" << std::endl;
  n = 100;
  std::cout << "n=" << n << " (main)" << std::endl;
  int& n_ = n;
  std::thread t([&] {
          std::cout << "t executing...\n";
          std::cout << "n=" << n << " (thread 1)\n";
          std::cout << "n_=" << n_ << " (thread 1)\n";
          n += 1;
          std::cout << "n=" << n << " (thread 1)\n";
          std::cout << "n_=" << n_ << " (thread 1)\n";
          std::cout << "t executing...DONE" << std::endl;
        });
  t.join();
  std::cout << "n=" << n << " (main, after t.join())\n";
  n = 200;
  std::cout << "n=" << n << " (main)" << std::endl;

  std::thread t2([&] {
          std::cout << "t2 executing...\n";
          std::cout << "n=" << n << " (thread 2)\n";
          std::cout << "n_=" << n_ << " (thread 2)\n";
          n += 1;
          std::cout << "n=" << n << " (thread 2)\n";
          std::cout << "n_=" << n_ << " (thread 2)\n";
          std::cout << "t2 executing...DONE" << std::endl;
        });
  t2.join();
  std::cout << "n=" << n << " (main, after t2.join())" << std::endl;
}

int main() {
  foo();
  std::cout << "---\n";
  foo();
  return 0;
}

输出：

n=1 (main)
n=100 (main)
t executing...
n=1 (thread 1)      # the thread used the "n = 1" init code
n_=100 (thread 1)   # the passed reference, not the thread_local
n=2 (thread 1)      # write to the thread_local
n_=100 (thread 1)   # did not change the passed reference
t executing...DONE
n=100 (main, after t.join())
n=200 (main)
t2 executing...
n=1 (thread 2)
n_=200 (thread 2)
n=2 (thread 2)
n_=200 (thread 2)
t2 executing...DONE
n=200 (main, after t2.join())
---
n=200 (main)        # second execution: old state is reused
n=100 (main)
t executing...
n=1 (thread 1)
n_=100 (thread 1)
n=2 (thread 1)
n_=100 (thread 1)
t executing...DONE
n=100 (main, after t.join())
n=200 (main)
t2 executing...
n=1 (thread 2)
n_=200 (thread 2)
n=2 (thread 2)
n_=200 (thread 2)
t2 executing...DONE
n=200 (main, after t2.join())

Answer 4

static thread_local 和 thread_local 在块范围内是等价的； thread_local有线程存储时长，不是静态的也不是自动的；因此，静态和自动说明符即 thread_local，即 auto thread_local 和 static thread_local 对存储持续时间没有影响；从语义上讲，它们使用起来毫无意义，并且由于 thread_local 的存在，它们只是隐含地表示线程存储持续时间； static 甚至不修改块范围内的链接（因为它始终没有链接），因此除了修改存储持续时间之外没有其他定义。 extern thread_local 在块范围内也是可能的。 static thread_local 在文件范围内给出 thread_local 变量内部链接，这意味着 TLS 中的每个翻译单元将有一个副本（每个翻译单元将在 TLS 索引处解析为自己的变量 .exe，因为汇编器会在.o文件的rdata$t段插入变量，并在符号table中由于缺少[=]而将其标记为局部符号28=] 符号上的指令）。 extern thread_local 在文件范围内是合法的，就像在块范围内一样，并使用在另一个翻译单元中定义的 thread_local 副本。 thread_local 在文件范围内不是隐式静态的，因为它可以为另一个翻译单元提供全局符号定义，这是 block-scope 变量无法完成的。

对于 ELF，编译器会将所有初始化的 thread_local 变量存储在 .tdata 中（包括 block-scope 个），对于 ELF，编译器会将未初始化的变量存储在 .tbss 中，或者全部存储在 .tls 为 PE 格式。我假设线程库在创建线程时将访问 .tls 段并执行 windows API 调用（TlsAlloc 和 TlsSetValue），它分配堆上每个 .exe 和 .dll 的变量，并在 GS 段中线程的 TEB 的 TLS 数组中放置一个指针，returns 分配的索引，以及调用 DLL_THREAD_ATTACH 动态库例程。据推测，指向由 _tls_start 和 _tls_end 定义的 space 中的值的指针是作为值指针传递给 TlsSetValue 的指针。

文件范围 static/extern thread_local 和块范围 (extern) thread_local 之间的区别与文件范围 static/extern 和块范围 static/extern 之间的一般区别相同，因为块范围thread_local 变量将在其定义的函数结束时超出范围，但由于线程存储持续时间，它仍然可以按地址返回和访问。

编译器知道 .tls 段中数据的索引，因此它可以直接替代访问 GS 段，如 godbolt 上所见。

MSVC

thread_local int a = 5;

int square(int num) {
thread_local int i = 5;
    return a * i;
}

_TLS    SEGMENT
int a DD        05H                           ; a
_TLS    ENDS
_TLS    SEGMENT
int `int square(int)'::`2'::i DD 05H                        ; `square'::`2'::i
_TLS    ENDS

num$ = 8
int square(int) PROC                                    ; square
        mov     DWORD PTR [rsp+8], ecx
        mov     eax, OFFSET FLAT:int a      ; a
        mov     eax, eax
        mov     ecx, DWORD PTR _tls_index
        mov     rdx, QWORD PTR gs:88
        mov     rcx, QWORD PTR [rdx+rcx*8]
        mov     edx, OFFSET FLAT:int `int square(int)'::`2'::i
        mov     edx, edx
        mov     r8d, DWORD PTR _tls_index
        mov     r9, QWORD PTR gs:88
        mov     r8, QWORD PTR [r9+r8*8]
        mov     eax, DWORD PTR [rcx+rax]
        imul    eax, DWORD PTR [r8+rdx]
        ret     0
int square(int) ENDP                                    ; square

这从gs:88加载一个64位指针（gs:[0x58]，这是thread-local存储数组的线性地址），然后使用[=加载一个64位指针52=]（这显然是定位数组中的索引*指针大小）。 Int a; 然后从这个指针+偏移量加载到 .tls 段中。鉴于两个变量使用相同的 _tls_index，这表明每个 .exe 都有一个索引，即每个 .tls 部分，实际上 .rdata 中的每个 TLS 目录都有一个 _tls_index，并且变量在 TLS 数组指向的地址打包在一起。 static thread_local 不同翻译单元中的变量将合并到 .tls 中，并全部打包在同一索引处。

我相信 mainCRTStartup，链接器总是包含在最终的 executable 中，如果它作为控制台应用程序链接，则将其设为入口点，引用 _tls_used 变量（因为每个 .exe 都需要自己的索引）并且在 libcmt.lib 定义它的任何 object 文件中进入 .rdata 的 T 片段是 pragma'd（因为 mainCRTStartup 引用它，链接器将把它包含在最终的 executable) 中。如果链接器找到对 _tls_used 变量的引用，它将确保包含它并确保 PE header TLS 目录指向它。

#pragma section(".rdata$T", long, read)    //creates a read only section called `.rdata` if not created and a fragment T in the section
#define _CRTALLOC(x) __declspec(allocate(x))
#pragma data_seg()   //set the compilers current default data section to `.data`

_CRTALLOC(".rdata$T")  //place in the section .rdata, fragment T
const IMAGE_TLS_DIRECTORY _tls_used =
{
 (ULONG)(ULONG_PTR) &_tls_start, // start of tls data in the tls section
 (ULONG)(ULONG_PTR) &_tls_end,   // end of tls data
 (ULONG)(ULONG_PTR) &_tls_index, // address of tls_index
 (ULONG)(ULONG_PTR) (&__xl_a+1), // pointer to callbacks
 (ULONG) 0,                      // size of tls zero fill
 (ULONG) 0                       // characteristics
};

http://www.nynaeve.net/?p=183

_tls_used是IMAGE_TLS_DIRECTORY结构类型的变量，初始化内容如上，实际定义在tlssup.c中。在此之前，它定义了 _tls_index、_tls_start 和 _tls_end，将 _tls_start 放在 .tls 部分的开头，将 _tls_end 放在结尾通过将 .tls 部分放在部分片段 ZZZ 中，使其按字母顺序在该部分的末尾结束：

#pragma data_seg(".tls") //set the compilers current default data section to `.tls`

#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls")   //place the following in the section named `.tls`
#endif
char _tls_start = 0;   //if not defined, place in the current default data section, which is also `.tls`

#pragma data_seg(".tls$ZZZ")

#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls$ZZZ")
#endif
char _tls_end = 0;

这些地址随后被用作 _tls_used TLS 目录中的标记。只有当 .tls 部分完成并且它具有固定的相对 lea 位置时，链接器才会解析该地址。

GCC（TLS 直接在 FS 基之前；原始数据而不是指针）

 mov    edx,DWORD PTR fs:0xfffffffffffffff8 //access thread_local int1 inside function
 mov    eax,DWORD PTR fs:0xfffffffffffffffc //access thread_local int2 inside function

使一个、两个或 none 个变量局部生成相同的代码。

当线程执行终止时，windows 上的线程库将使用 TlsFree() 调用释放存储（它还必须释放指向由 [=79 返回的指针的堆上的内存） =]).

thread_local 在块范围

thread_local at block scope

c++

multithreading

thread-local-storage