在两个逻辑 CPU 之间共享一个 TLB 条目（Intel）

Question

我想知道当属于同一程序且具有相同 PCID 的两个线程被调度到同一物理 CPU 上的运行时是否可以共享 TLB 条目？

我已经研究过 SDM (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)；第 3115 页（TLB 和 HT）没有提到任何共享机制。但文档的另一部分指出，在访问TLB条目之前，检查PCID值，如果相等，则使用该值。但是，PCID 标识符旁边还有一个用于当前线程集的位。

我的问题：使用的 PCID 值是否优先于 CPU-thread 位，或者两个值是否必须匹配？

Answer 1

根据我的观察，这是不可能的（至少对于 dTLB），即使它会带来性能优势。

我是怎么得出这个结论的

按照 Peter 的建议，我编写了一个小程序，其中包含两个反复访问同一个堆区域的工作线程。

使用 -O0 编译以防止优化。

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <inttypes.h>
#include <err.h>
#include <sched.h>
#include <sys/mman.h>

#define PAGE_SIZE 4096

int repetitions = 1ll << 20;
uint64_t ptrsize = 1ll<<18;
uint64_t main_cpu, co_cpu ;

void pin_task_to(int pid, int cpu)
{
    cpu_set_t cset;
    CPU_ZERO(&cset);
    CPU_SET(cpu, &cset);
    if (sched_setaffinity(pid, sizeof(cpu_set_t), &cset))
        err(1, "affinity");
}
void pin_to(int cpu) { pin_task_to(0, cpu); }


void *foo(void *p)
{
    pin_to(main_cpu);

    int value;
    uint8_t *ptr = (uint8_t *)p;
    printf("Running on CPU: %d\n", sched_getcpu());
    for (size_t j = 0; j < repetitions; j++)
    {
        for (size_t i = 0; i < ptrsize; i += PAGE_SIZE)
        {
            value += ptr[i];
        }
    }
    volatile int dummy = value;
    pthread_exit(NULL);
}

void *boo(void *p)
{
    pin_to(co_cpu);

    int value;
    uint8_t *ptr = (uint8_t *)p;
    printf("Running on CPU: %d\n", sched_getcpu());
    for (size_t j = 0; j < repetitions; j++)
    {
        for (size_t i = 0; i < ptrsize; i+=PAGE_SIZE)
        {
            value += ptr[i];
        }
    }
    volatile int dummy = value;
    pthread_exit(NULL);
}

int main(int argc, char **argv)
{
    if (argc < 3){
        exit(-1);
    }
    main_cpu = strtoul(argv[1], NULL, 16);
    co_cpu = strtoul(argv[2], NULL, 16);
    pthread_t id[2];
    void *mptr = malloc(ptrsize);

    pthread_create(&id[0], NULL, foo, mptr);
    pthread_create(&id[1], NULL, boo, mptr);

    pthread_join(id[0], NULL);
    pthread_join(id[1], NULL);
}

我决定对内存区域中的所有值求和（显然，value 会溢出）以防止 CPU 进行微架构优化。

[另一个想法是简单地逐字节取消引用内存区域并将值加载到 RAX]

我们遍历内存区域 repetitions 次以减少一个运行内的噪音，这是由于线程和其他进程的启动时间略有不同以及系统中断引起的。

结果

我的机器有四个物理内核和八个逻辑内核。逻辑核心 x 和 x+4 位于同一个物理核心 (lstopo)。

运行在同一个逻辑核心上

由于内核使用 PCID 来识别 TLB 条目，因此切换到另一个线程的上下文不应使 TLB 无效。

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 1
Running on CPU: 1
Running on CPU: 1

 Performance counter stats for './main 1 1':

        12,621,724      dtlb_load_misses.stlb_hit:u #   49.035 M/sec
             1,152      dtlb_load_misses.miss_causes_a_walk:u #    4.475 K/sec
       834,363,092      cycles:u                  #    3.241 GHz
            257.40 msec task-clock:u              #    0.997 CPUs utilized

       0.258177969 seconds time elapsed

       0.258253000 seconds user
       0.000000000 seconds sys

运行在两个不同的物理内核上

没有任何 TLB 共享或干扰。

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 2
Running on CPU: 1
Running on CPU: 2

 Performance counter stats for './main 1 2':

        11,740,758      dtlb_load_misses.stlb_hit:u #   45.962 M/sec
             1,647      dtlb_load_misses.miss_causes_a_walk:u #    6.448 K/sec
       834,021,644      cycles:u                  #    3.265 GHz
            255.44 msec task-clock:u              #    1.991 CPUs utilized

       0.128304564 seconds time elapsed

       0.255768000 seconds user
       0.000000000 seconds sys

运行在同一个物理内核上

如果 TLB 共享是可能的，我希望这里有最低的 sTLB 点击率和较少的 dTLB 页面浏览。但是相反，我们在这两种情况下的数量最多。

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 5
Running on CPU: 1
Running on CPU: 5

 Performance counter stats for './main 1 5':

       140,040,429      dtlb_load_misses.stlb_hit:u #  291.368 M/sec
           198,827      dtlb_load_misses.miss_causes_a_walk:u #  413.680 K/sec
     1,596,298,827      cycles:u                  #    3.321 GHz
            480.63 msec task-clock:u              #    1.990 CPUs utilized

       0.241509701 seconds time elapsed

       0.480996000 seconds user
       0.000000000 seconds sys

结论

如您所见，当运行在同一物理内核上运行时，我们有最多 sTLB 次点击和 dTLB 次页面浏览。因此，我会从中得出结论，同一物理内核上的同一 PCID 没有共享机制。运行同一逻辑核心和两个不同物理核心上的进程导致对 sTLB 的 misses/hits 数量大致相同。这进一步支持了在同一逻辑内核上共享但在物理内核上不共享的论点。

更新

正如 Peter 所建议的那样，还使用 linked-list 方法来防止 THP 和预取。修改后的数据如下图

用-O0编译以防止优化

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <inttypes.h>
#include <err.h>
#include <sched.h>
#include <time.h>
#include <sys/mman.h>

#define PAGE_SIZE 4096

const int repetitions = 1ll << 20;
const uint64_t ptrsize = 1ll<< 5;
uint64_t main_cpu, co_cpu ;

void pin_task_to(int pid, int cpu)
{
    cpu_set_t cset;
    CPU_ZERO(&cset);
    CPU_SET(cpu, &cset);
    if (sched_setaffinity(pid, sizeof(cpu_set_t), &cset))
        err(1, "affinity");
}
void pin_to(int cpu) { pin_task_to(0, cpu); }


void *foo(void *p)
{
    pin_to(main_cpu);

    uint64_t *value;
    uint64_t *ptr = (uint64_t *)p;
    printf("Running on CPU: %d\n", sched_getcpu());
    for (size_t j = 0; j < repetitions; j++)
    {
        value = ptr;
        for (size_t i = 0; i < ptrsize; i++)
        {
            value = (uint64_t *)*value;
        }
    }
    volatile uint64_t *dummy = value;
    pthread_exit(NULL);
}

void *boo(void *p)
{
    pin_to(co_cpu);

    uint64_t *value;
    uint64_t *ptr = (uint64_t *)p;
    printf("Running on CPU: %d\n", sched_getcpu());
    for (size_t j = 0; j < repetitions; j++)
    {
        value = ptr;
        for (size_t i = 0; i < ptrsize; i++)
        {
            value = (uint64_t *)*value;
        }
    }
    volatile uint64_t *dummy = value;
    pthread_exit(NULL);
}

int main(int argc, char **argv)
{
    if (argc < 3){
        exit(-1);
    }
    srand(time(NULL));

    uint64_t *head,*tail,*tmp_ptr;
    int r;
    head = mmap(NULL,PAGE_SIZE,PROT_READ|PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS,0,0);
    tail = head;
    for (size_t i = 0; i < ptrsize; i++)
    {
        r = (rand() & 0xF) +1;
        // try to use differents offset to the next page to prevent microarch prefetching
        tmp_ptr = mmap(tail-r*PAGE_SIZE, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
        *tail = (uint64_t)tmp_ptr;
        tail = tmp_ptr;
    }

    printf("%Lx, %lx\n", head, *head);
    main_cpu = strtoul(argv[1], NULL, 16);
    co_cpu = strtoul(argv[2], NULL, 16);
    pthread_t id[2];

    pthread_create(&id[0], NULL, foo, head);
    pthread_create(&id[1], NULL, boo, head);

    pthread_join(id[0], NULL);
    pthread_join(id[1], NULL);
}

相同的逻辑核心

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 1                                 
7feac4d90000, 7feac4d5b000
Running on CPU: 1
Running on CPU: 1

 Performance counter stats for './main 1 1':

             3,696      dtlb_load_misses.stlb_hit:u #   11.679 K/sec
               743      dtlb_load_misses.miss_causes_a_walk:u #    2.348 K/sec
       762,856,367      cycles:u                  #    2.410 GHz
            316.48 msec task-clock:u              #    0.998 CPUs utilized

       0.317105072 seconds time elapsed

       0.316859000 seconds user
       0.000000000 seconds sys

不同的物理内核

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 2                                 
7f59bb395000, 7f59bb34d000
Running on CPU: 1
Running on CPU: 2

 Performance counter stats for './main 1 2':

            15,144      dtlb_load_misses.stlb_hit:u #   49.480 K/sec
               756      dtlb_load_misses.miss_causes_a_walk:u #    2.470 K/sec
       770,800,780      cycles:u                  #    2.518 GHz
            306.06 msec task-clock:u              #    1.982 CPUs utilized

       0.154410840 seconds time elapsed

       0.306345000 seconds user
       0.000000000 seconds sys

相同的物理内核/不同的逻辑内核

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 5                                 
7f7d69e8b000, 7f7d69e56000
Running on CPU: 5
Running on CPU: 1

 Performance counter stats for './main 1 5':

         9,237,992      dtlb_load_misses.stlb_hit:u #   20.554 M/sec
               789      dtlb_load_misses.miss_causes_a_walk:u #    1.755 K/sec
     1,007,185,858      cycles:u                  #    2.241 GHz
            449.45 msec task-clock:u              #    1.989 CPUs utilized

       0.225947522 seconds time elapsed

       0.449813000 seconds user
       0.000000000 seconds sys

在两个逻辑 CPU 之间共享一个 TLB 条目（Intel）

Sharing a TLB entry between two logical CPUs (Intel)

x86

intel

cpu-architecture

hyperthreading

tlb

我是怎么得出这个结论的

结果

结论

更新