使用 1GB 页面会降低性能
Using 1GB pages degrade performance
我有一个应用程序,我需要大约 850 MB 的连续内存并以随机方式访问它。有人建议我分配一个 1 GB 的大页面,这样它就会一直在 TLB 中。我已经编写了一个具有 sequential/random 访问权限的演示来衡量小页面(在我的例子中为 4 KB)与大页面(1 GB)的性能:
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>
#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) // Aren't used in this example.
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
#define MESSINESS_LEVEL 512 // Poisons caches if LRU policy is used.
#define RUN_TESTS 25
void print_usage() {
printf("Usage: ./program small|huge1gb sequential|random\n");
}
int main(int argc, char *argv[]) {
if (argc != 3 && argc != 4) {
print_usage();
return -1;
}
uint64_t size = 1UL * 1024 * 1024 * 1024; // 1GB
uint32_t *ptr;
if (strcmp(argv[1], "small") == 0) {
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, // basically malloc(size);
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (ptr == MAP_FAILED) {
perror("mmap small");
exit(1);
}
} else if (strcmp(argv[1], "huge1gb") == 0) {
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, -1, 0);
if (ptr == MAP_FAILED) {
perror("mmap huge1gb");
exit(1);
}
} else {
print_usage();
return -1;
}
clock_t start_time, end_time;
start_time = clock();
if (strcmp(argv[2], "sequential") == 0) {
for (int iter = 0; iter < RUN_TESTS; iter++) {
for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
ptr[i] = i * 5;
}
} else if (strcmp(argv[2], "random") == 0) {
// pseudorandom access pattern, defeats caches.
uint64_t index;
for (int iter = 0; iter < RUN_TESTS; iter++) {
for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
ptr[index] = index * 5;
}
}
}
} else {
print_usage();
return -1;
}
end_time = clock();
long double duration = (long double)(end_time - start_time) / CLOCKS_PER_SEC;
printf("Avr. Duration per test: %Lf\n", duration / RUN_TESTS);
// write(1, ptr, size); // Dumps memory content (1GB to stdout).
}
在我的机器上(更多内容见下文)结果是:
顺序:
$ ./test small sequential
Avr. Duration per test: 0.562386
$ ./test huge1gb sequential <--- slightly better
Avr. Duration per test: 0.543532
随机:
$ ./test small random <--- better
Avr. Duration per test: 2.911480
$ ./test huge1gb random
Avr. Duration per test: 6.461034
随机测试很烦,1GB的页面好像慢了2倍!
我尝试使用 madvise
和 MADV_SEQUENTIAL
/ MADV_SEQUENTIAL
进行相应的测试,但没有帮助。
为什么在随机访问的情况下使用一个大页面会降低性能?大页面(2MB 和 1GB)一般有哪些用例?
我没有用 2MB 的页面测试这段代码,我认为它应该可以做得更好。我还怀疑,由于一个 1GB 的页面存储在一个内存条中,它可能与 multi-channels 有关。但我想听听你们的意见。谢谢。
注意:要进行 运行 测试,您必须首先在内核中启用 1GB 页面。你可以通过给内核这个参数 hugepagesz=1G hugepages=1 default_hugepagesz=1G
来做到这一点。更多:https://wiki.archlinux.org/index.php/Kernel_parameters。如果启用,你应该得到类似的东西:
$ cat /proc/meminfo | grep Huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 1
HugePages_Free: 1
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 1048576 kB
EDIT1:我的机器有 Core i5 8600 和 4 个内存条,每个内存条 4 GB。 CPU 原生支持 2MB 和 1GB 页面(它有 pse
和 pdpe1gb
标志,参见:https://wiki.debian.org/Hugepages#x86_64)。我测量的是机器时间,而不是 CPU 时间,我更新了代码,结果现在是 25 次测试的平均值。
我还被告知此测试在 2MB 页面上的表现优于普通 4KB 页面。
不是答案,而是为这个令人困惑的问题提供更多详细信息。
性能计数器显示大致相似的指令数,但大约是使用大页面时花费的周期数的两倍:
- 4KiB 页面 IPC 0.29,
- 1GiB 页 IPC 0.10。
这些 IPC 数字表明代码在内存访问方面存在瓶颈(CPU Skylake 上绑定的 IPC 为 3 及以上)。大页面瓶颈更难。
我修改了你的基准测试,在这两种情况下都使用 MAP_POPULATE | MAP_LOCKED | MAP_FIXED
和固定地址 0x600000000000
,以消除与页面错误和随机映射地址相关的时间变化。在我的 Skylake 系统上,2MiB 和 1GiB 比 4kiB 页面慢 2 倍以上。
使用 g++-8.4.0 -std=gnu++14 -pthread -m{arch,tune}=skylake -O3 -DNDEBUG
编译:
[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 2MB:64 --pool-pages-max 2MB:64
[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1
[max@supernova:~/src/test] $ for s in small huge; do sudo chrt -f 40 taskset -c 7 perf stat -dd ./release/gcc/test $s random; done
Duration: 2156150
Performance counter stats for './release/gcc/test small random':
2291.190394 task-clock (msec) # 1.000 CPUs utilized
1 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
53 page-faults # 0.023 K/sec
11,448,252,551 cycles # 4.997 GHz (30.83%)
3,268,573,978 instructions # 0.29 insn per cycle (38.55%)
430,248,155 branches # 187.784 M/sec (38.55%)
758,917 branch-misses # 0.18% of all branches (38.55%)
224,593,751 L1-dcache-loads # 98.025 M/sec (38.55%)
561,979,341 L1-dcache-load-misses # 250.22% of all L1-dcache hits (38.44%)
271,067,656 LLC-loads # 118.309 M/sec (30.73%)
668,118 LLC-load-misses # 0.25% of all LL-cache hits (30.73%)
<not supported> L1-icache-loads
220,251 L1-icache-load-misses (30.73%)
286,864,314 dTLB-loads # 125.203 M/sec (30.73%)
6,314 dTLB-load-misses # 0.00% of all dTLB cache hits (30.73%)
29 iTLB-loads # 0.013 K/sec (30.73%)
6,366 iTLB-load-misses # 21951.72% of all iTLB cache hits (30.73%)
2.291300162 seconds time elapsed
Duration: 4349681
Performance counter stats for './release/gcc/test huge random':
4385.282466 task-clock (msec) # 1.000 CPUs utilized
1 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
53 page-faults # 0.012 K/sec
21,911,541,450 cycles # 4.997 GHz (30.70%)
2,175,972,910 instructions # 0.10 insn per cycle (38.45%)
274,356,392 branches # 62.563 M/sec (38.54%)
560,941 branch-misses # 0.20% of all branches (38.63%)
7,966,853 L1-dcache-loads # 1.817 M/sec (38.70%)
292,131,592 L1-dcache-load-misses # 3666.84% of all L1-dcache hits (38.65%)
27,531 LLC-loads # 0.006 M/sec (30.81%)
12,413 LLC-load-misses # 45.09% of all LL-cache hits (30.72%)
<not supported> L1-icache-loads
353,438 L1-icache-load-misses (30.65%)
7,252,590 dTLB-loads # 1.654 M/sec (30.65%)
440 dTLB-load-misses # 0.01% of all dTLB cache hits (30.65%)
274 iTLB-loads # 0.062 K/sec (30.65%)
9,577 iTLB-load-misses # 3495.26% of all iTLB cache hits (30.65%)
4.385392278 seconds time elapsed
运行 on Ubuntu 18.04.5 LTS with Intel i9-9900KS (which is not NUMA), 4x8GiB 4GHz CL17 RAM in all 4 slots, with performance
governor for no CPU 频率缩放,最大液体冷却风扇无热节流,FIFO 40 优先级无抢占,在一个特定的 CPU 核心上无 CPU 迁移,多次运行。结果与 clang++-8.0.0
编译器相似。
感觉硬件中有些问题,比如每个页面帧的存储缓冲区,因此 4KiB 页面允许每单位时间多存储约 2 倍。
看到 AMD Ryzen 3 CPUs 的结果会很有趣。
在 AMD Ryzen 3 5950X 上,大页面版本最多只慢 10%:
Duration: 1578723
Performance counter stats for './release/gcc/test small random':
1,726.89 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1,947 page-faults # 0.001 M/sec
8,189,576,204 cycles # 4.742 GHz (33.02%)
3,174,036 stalled-cycles-frontend # 0.04% frontend cycles idle (33.14%)
95,950 stalled-cycles-backend # 0.00% backend cycles idle (33.25%)
3,301,760,473 instructions # 0.40 insn per cycle
# 0.00 stalled cycles per insn (33.37%)
480,276,481 branches # 278.116 M/sec (33.49%)
864,075 branch-misses # 0.18% of all branches (33.59%)
709,483,403 L1-dcache-loads # 410.844 M/sec (33.59%)
1,608,181,551 L1-dcache-load-misses # 226.67% of all L1-dcache accesses (33.59%)
<not supported> LLC-loads
<not supported> LLC-load-misses
78,963,441 L1-icache-loads # 45.726 M/sec (33.59%)
46,639 L1-icache-load-misses # 0.06% of all L1-icache accesses (33.51%)
301,463,437 dTLB-loads # 174.570 M/sec (33.39%)
301,698,272 dTLB-load-misses # 100.08% of all dTLB cache accesses (33.28%)
54 iTLB-loads # 0.031 K/sec (33.16%)
2,774 iTLB-load-misses # 5137.04% of all iTLB cache accesses (33.05%)
243,732,886 L1-dcache-prefetches # 141.140 M/sec (33.01%)
<not supported> L1-dcache-prefetch-misses
1.727052901 seconds time elapsed
1.579089000 seconds user
0.147914000 seconds sys
Duration: 1628512
Performance counter stats for './release/gcc/test huge random':
1,680.06 msec task-clock # 1.000 CPUs utilized
1 context-switches # 0.001 K/sec
1 cpu-migrations # 0.001 K/sec
1,947 page-faults # 0.001 M/sec
8,037,708,678 cycles # 4.784 GHz (33.34%)
4,684,831 stalled-cycles-frontend # 0.06% frontend cycles idle (33.34%)
2,445,415 stalled-cycles-backend # 0.03% backend cycles idle (33.34%)
2,217,699,442 instructions # 0.28 insn per cycle
# 0.00 stalled cycles per insn (33.34%)
281,522,918 branches # 167.567 M/sec (33.34%)
549,427 branch-misses # 0.20% of all branches (33.33%)
312,930,677 L1-dcache-loads # 186.261 M/sec (33.33%)
1,614,505,314 L1-dcache-load-misses # 515.93% of all L1-dcache accesses (33.33%)
<not supported> LLC-loads
<not supported> LLC-load-misses
888,872 L1-icache-loads # 0.529 M/sec (33.33%)
13,140 L1-icache-load-misses # 1.48% of all L1-icache accesses (33.33%)
9,168 dTLB-loads # 0.005 M/sec (33.33%)
870 dTLB-load-misses # 9.49% of all dTLB cache accesses (33.33%)
1,173 iTLB-loads # 0.698 K/sec (33.33%)
1,914 iTLB-load-misses # 163.17% of all iTLB cache accesses (33.33%)
253,307,275 L1-dcache-prefetches # 150.772 M/sec (33.33%)
<not supported> L1-dcache-prefetch-misses
1.680230802 seconds time elapsed
1.628170000 seconds user
0.052005000 seconds sys
英特尔很友好地回复了这个问题。请参阅下面的答案。
此问题是由于物理页面的实际提交方式所致。在 1GB 页面的情况下,内存是连续的。因此,一旦您写入 1GB 页面中的任何一个字节,就会分配整个 1GB 页面。但是,对于 4KB 页面,物理页面会在您第一次触摸每个 4KB 页面时分配。
for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
ptr[index] = index * 5;
}
}
在最内层循环中,索引以512KB的步幅变化。因此,连续引用映射到 512KB 偏移处。通常缓存有 2048 个集合(即 2^11)。所以,bits 6:16 select 集合。但是,如果你以 512KB 的偏移量跨步,位 6:16 将是相同的,最终 select 使用相同的集合并失去空间局部性。
我们建议在启动时钟计时之前按如下顺序初始化整个 1GB 缓冲区(在小页面测试中)
for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
ptr[i] = i * 5;
基本上,问题是由于非常大的常量偏移量导致大页面与小页面相比设置冲突导致缓存未命中。当你使用常数偏移量时,测试真的不是 random.
我有一个应用程序,我需要大约 850 MB 的连续内存并以随机方式访问它。有人建议我分配一个 1 GB 的大页面,这样它就会一直在 TLB 中。我已经编写了一个具有 sequential/random 访问权限的演示来衡量小页面(在我的例子中为 4 KB)与大页面(1 GB)的性能:
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>
#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) // Aren't used in this example.
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
#define MESSINESS_LEVEL 512 // Poisons caches if LRU policy is used.
#define RUN_TESTS 25
void print_usage() {
printf("Usage: ./program small|huge1gb sequential|random\n");
}
int main(int argc, char *argv[]) {
if (argc != 3 && argc != 4) {
print_usage();
return -1;
}
uint64_t size = 1UL * 1024 * 1024 * 1024; // 1GB
uint32_t *ptr;
if (strcmp(argv[1], "small") == 0) {
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, // basically malloc(size);
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (ptr == MAP_FAILED) {
perror("mmap small");
exit(1);
}
} else if (strcmp(argv[1], "huge1gb") == 0) {
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, -1, 0);
if (ptr == MAP_FAILED) {
perror("mmap huge1gb");
exit(1);
}
} else {
print_usage();
return -1;
}
clock_t start_time, end_time;
start_time = clock();
if (strcmp(argv[2], "sequential") == 0) {
for (int iter = 0; iter < RUN_TESTS; iter++) {
for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
ptr[i] = i * 5;
}
} else if (strcmp(argv[2], "random") == 0) {
// pseudorandom access pattern, defeats caches.
uint64_t index;
for (int iter = 0; iter < RUN_TESTS; iter++) {
for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
ptr[index] = index * 5;
}
}
}
} else {
print_usage();
return -1;
}
end_time = clock();
long double duration = (long double)(end_time - start_time) / CLOCKS_PER_SEC;
printf("Avr. Duration per test: %Lf\n", duration / RUN_TESTS);
// write(1, ptr, size); // Dumps memory content (1GB to stdout).
}
在我的机器上(更多内容见下文)结果是:
顺序:
$ ./test small sequential
Avr. Duration per test: 0.562386
$ ./test huge1gb sequential <--- slightly better
Avr. Duration per test: 0.543532
随机:
$ ./test small random <--- better
Avr. Duration per test: 2.911480
$ ./test huge1gb random
Avr. Duration per test: 6.461034
随机测试很烦,1GB的页面好像慢了2倍!
我尝试使用 madvise
和 MADV_SEQUENTIAL
/ MADV_SEQUENTIAL
进行相应的测试,但没有帮助。
为什么在随机访问的情况下使用一个大页面会降低性能?大页面(2MB 和 1GB)一般有哪些用例?
我没有用 2MB 的页面测试这段代码,我认为它应该可以做得更好。我还怀疑,由于一个 1GB 的页面存储在一个内存条中,它可能与 multi-channels 有关。但我想听听你们的意见。谢谢。
注意:要进行 运行 测试,您必须首先在内核中启用 1GB 页面。你可以通过给内核这个参数 hugepagesz=1G hugepages=1 default_hugepagesz=1G
来做到这一点。更多:https://wiki.archlinux.org/index.php/Kernel_parameters。如果启用,你应该得到类似的东西:
$ cat /proc/meminfo | grep Huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 1
HugePages_Free: 1
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 1048576 kB
EDIT1:我的机器有 Core i5 8600 和 4 个内存条,每个内存条 4 GB。 CPU 原生支持 2MB 和 1GB 页面(它有 pse
和 pdpe1gb
标志,参见:https://wiki.debian.org/Hugepages#x86_64)。我测量的是机器时间,而不是 CPU 时间,我更新了代码,结果现在是 25 次测试的平均值。
我还被告知此测试在 2MB 页面上的表现优于普通 4KB 页面。
不是答案,而是为这个令人困惑的问题提供更多详细信息。
性能计数器显示大致相似的指令数,但大约是使用大页面时花费的周期数的两倍:
- 4KiB 页面 IPC 0.29,
- 1GiB 页 IPC 0.10。
这些 IPC 数字表明代码在内存访问方面存在瓶颈(CPU Skylake 上绑定的 IPC 为 3 及以上)。大页面瓶颈更难。
我修改了你的基准测试,在这两种情况下都使用 MAP_POPULATE | MAP_LOCKED | MAP_FIXED
和固定地址 0x600000000000
,以消除与页面错误和随机映射地址相关的时间变化。在我的 Skylake 系统上,2MiB 和 1GiB 比 4kiB 页面慢 2 倍以上。
使用 g++-8.4.0 -std=gnu++14 -pthread -m{arch,tune}=skylake -O3 -DNDEBUG
编译:
[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 2MB:64 --pool-pages-max 2MB:64
[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1
[max@supernova:~/src/test] $ for s in small huge; do sudo chrt -f 40 taskset -c 7 perf stat -dd ./release/gcc/test $s random; done
Duration: 2156150
Performance counter stats for './release/gcc/test small random':
2291.190394 task-clock (msec) # 1.000 CPUs utilized
1 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
53 page-faults # 0.023 K/sec
11,448,252,551 cycles # 4.997 GHz (30.83%)
3,268,573,978 instructions # 0.29 insn per cycle (38.55%)
430,248,155 branches # 187.784 M/sec (38.55%)
758,917 branch-misses # 0.18% of all branches (38.55%)
224,593,751 L1-dcache-loads # 98.025 M/sec (38.55%)
561,979,341 L1-dcache-load-misses # 250.22% of all L1-dcache hits (38.44%)
271,067,656 LLC-loads # 118.309 M/sec (30.73%)
668,118 LLC-load-misses # 0.25% of all LL-cache hits (30.73%)
<not supported> L1-icache-loads
220,251 L1-icache-load-misses (30.73%)
286,864,314 dTLB-loads # 125.203 M/sec (30.73%)
6,314 dTLB-load-misses # 0.00% of all dTLB cache hits (30.73%)
29 iTLB-loads # 0.013 K/sec (30.73%)
6,366 iTLB-load-misses # 21951.72% of all iTLB cache hits (30.73%)
2.291300162 seconds time elapsed
Duration: 4349681
Performance counter stats for './release/gcc/test huge random':
4385.282466 task-clock (msec) # 1.000 CPUs utilized
1 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
53 page-faults # 0.012 K/sec
21,911,541,450 cycles # 4.997 GHz (30.70%)
2,175,972,910 instructions # 0.10 insn per cycle (38.45%)
274,356,392 branches # 62.563 M/sec (38.54%)
560,941 branch-misses # 0.20% of all branches (38.63%)
7,966,853 L1-dcache-loads # 1.817 M/sec (38.70%)
292,131,592 L1-dcache-load-misses # 3666.84% of all L1-dcache hits (38.65%)
27,531 LLC-loads # 0.006 M/sec (30.81%)
12,413 LLC-load-misses # 45.09% of all LL-cache hits (30.72%)
<not supported> L1-icache-loads
353,438 L1-icache-load-misses (30.65%)
7,252,590 dTLB-loads # 1.654 M/sec (30.65%)
440 dTLB-load-misses # 0.01% of all dTLB cache hits (30.65%)
274 iTLB-loads # 0.062 K/sec (30.65%)
9,577 iTLB-load-misses # 3495.26% of all iTLB cache hits (30.65%)
4.385392278 seconds time elapsed
运行 on Ubuntu 18.04.5 LTS with Intel i9-9900KS (which is not NUMA), 4x8GiB 4GHz CL17 RAM in all 4 slots, with performance
governor for no CPU 频率缩放,最大液体冷却风扇无热节流,FIFO 40 优先级无抢占,在一个特定的 CPU 核心上无 CPU 迁移,多次运行。结果与 clang++-8.0.0
编译器相似。
感觉硬件中有些问题,比如每个页面帧的存储缓冲区,因此 4KiB 页面允许每单位时间多存储约 2 倍。
看到 AMD Ryzen 3 CPUs 的结果会很有趣。
在 AMD Ryzen 3 5950X 上,大页面版本最多只慢 10%:
Duration: 1578723
Performance counter stats for './release/gcc/test small random':
1,726.89 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1,947 page-faults # 0.001 M/sec
8,189,576,204 cycles # 4.742 GHz (33.02%)
3,174,036 stalled-cycles-frontend # 0.04% frontend cycles idle (33.14%)
95,950 stalled-cycles-backend # 0.00% backend cycles idle (33.25%)
3,301,760,473 instructions # 0.40 insn per cycle
# 0.00 stalled cycles per insn (33.37%)
480,276,481 branches # 278.116 M/sec (33.49%)
864,075 branch-misses # 0.18% of all branches (33.59%)
709,483,403 L1-dcache-loads # 410.844 M/sec (33.59%)
1,608,181,551 L1-dcache-load-misses # 226.67% of all L1-dcache accesses (33.59%)
<not supported> LLC-loads
<not supported> LLC-load-misses
78,963,441 L1-icache-loads # 45.726 M/sec (33.59%)
46,639 L1-icache-load-misses # 0.06% of all L1-icache accesses (33.51%)
301,463,437 dTLB-loads # 174.570 M/sec (33.39%)
301,698,272 dTLB-load-misses # 100.08% of all dTLB cache accesses (33.28%)
54 iTLB-loads # 0.031 K/sec (33.16%)
2,774 iTLB-load-misses # 5137.04% of all iTLB cache accesses (33.05%)
243,732,886 L1-dcache-prefetches # 141.140 M/sec (33.01%)
<not supported> L1-dcache-prefetch-misses
1.727052901 seconds time elapsed
1.579089000 seconds user
0.147914000 seconds sys
Duration: 1628512
Performance counter stats for './release/gcc/test huge random':
1,680.06 msec task-clock # 1.000 CPUs utilized
1 context-switches # 0.001 K/sec
1 cpu-migrations # 0.001 K/sec
1,947 page-faults # 0.001 M/sec
8,037,708,678 cycles # 4.784 GHz (33.34%)
4,684,831 stalled-cycles-frontend # 0.06% frontend cycles idle (33.34%)
2,445,415 stalled-cycles-backend # 0.03% backend cycles idle (33.34%)
2,217,699,442 instructions # 0.28 insn per cycle
# 0.00 stalled cycles per insn (33.34%)
281,522,918 branches # 167.567 M/sec (33.34%)
549,427 branch-misses # 0.20% of all branches (33.33%)
312,930,677 L1-dcache-loads # 186.261 M/sec (33.33%)
1,614,505,314 L1-dcache-load-misses # 515.93% of all L1-dcache accesses (33.33%)
<not supported> LLC-loads
<not supported> LLC-load-misses
888,872 L1-icache-loads # 0.529 M/sec (33.33%)
13,140 L1-icache-load-misses # 1.48% of all L1-icache accesses (33.33%)
9,168 dTLB-loads # 0.005 M/sec (33.33%)
870 dTLB-load-misses # 9.49% of all dTLB cache accesses (33.33%)
1,173 iTLB-loads # 0.698 K/sec (33.33%)
1,914 iTLB-load-misses # 163.17% of all iTLB cache accesses (33.33%)
253,307,275 L1-dcache-prefetches # 150.772 M/sec (33.33%)
<not supported> L1-dcache-prefetch-misses
1.680230802 seconds time elapsed
1.628170000 seconds user
0.052005000 seconds sys
英特尔很友好地回复了这个问题。请参阅下面的答案。
此问题是由于物理页面的实际提交方式所致。在 1GB 页面的情况下,内存是连续的。因此,一旦您写入 1GB 页面中的任何一个字节,就会分配整个 1GB 页面。但是,对于 4KB 页面,物理页面会在您第一次触摸每个 4KB 页面时分配。
for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
ptr[index] = index * 5;
}
}
在最内层循环中,索引以512KB的步幅变化。因此,连续引用映射到 512KB 偏移处。通常缓存有 2048 个集合(即 2^11)。所以,bits 6:16 select 集合。但是,如果你以 512KB 的偏移量跨步,位 6:16 将是相同的,最终 select 使用相同的集合并失去空间局部性。
我们建议在启动时钟计时之前按如下顺序初始化整个 1GB 缓冲区(在小页面测试中)
for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
ptr[i] = i * 5;
基本上,问题是由于非常大的常量偏移量导致大页面与小页面相比设置冲突导致缓存未命中。当你使用常数偏移量时,测试真的不是 random.