缓存行对齐优化不会减少缓存未命中
Cache line alignment optimization not reducing cache miss
我得到了这段代码,演示了缓存行对齐优化如何通过从 http://blog.kongfy.com/2016/10/cache-coherence-sequential-consistency-and-memory-barrier/
减少 'false sharing' 来工作
代码:
/*
* Demo program for showing the drawback of "false sharing"
*
* Use it with perf!
*
* Compile: g++ -O2 -o false_share false_share.cpp -lpthread
* Usage: perf stat -e cache-misses ./false_share <loopcount> <is_aligned>
*/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
#define CACHE_ALIGN_SIZE 64
#define CACHE_ALIGNED __attribute__((aligned(CACHE_ALIGN_SIZE)))
int gLoopCount;
inline int64_t current_time()
{
struct timeval t;
if (gettimeofday(&t, NULL) < 0) {
}
return (static_cast<int64_t>(t.tv_sec) * static_cast<int64_t>(1000000) + static_cast<int64_t>(t.tv_usec));
}
struct value {
int64_t val;
};
value data[2] CACHE_ALIGNED;
struct aligned_value {
int64_t val;
} CACHE_ALIGNED;
aligned_value aligned_data[2] CACHE_ALIGNED;
void* worker1(int64_t *val)
{
printf("worker1 start...\n");
volatile int64_t &v = *val;
for (int i = 0; i < gLoopCount; ++i) {
v += 1;
}
printf("worker1 exit...\n");
}
// duplicate worker function for perf report
void* worker2(int64_t *val)
{
printf("worker2 start...\n");
volatile int64_t &v = *val;
for (int i = 0; i < gLoopCount; ++i) {
v += 1;
}
printf("worker2 exit...\n");
}
int main(int argc, char *argv[])
{
pthread_t race_thread_1;
pthread_t race_thread_2;
bool is_aligned;
/* Check arguments to program*/
if(argc != 3) {
fprintf(stderr, "USAGE: %s <loopcount> <is_aligned>\n", argv[0]);
exit(1);
}
/* Parse argument */
gLoopCount = atoi(argv[1]); /* Don't bother with format checking */
is_aligned = atoi(argv[2]); /* Don't bother with format checking */
printf("size of unaligned data : %d\n", sizeof(data));
printf("size of aligned data : %d\n", sizeof(aligned_data));
void *val_0, *val_1;
if (is_aligned) {
val_0 = (void *)&aligned_data[0].val;
val_1 = (void *)&aligned_data[1].val;
} else {
val_0 = (void *)&data[0].val;
val_1 = (void *)&data[1].val;
}
int64_t start_time = current_time();
/* Start the threads */
pthread_create(&race_thread_1, NULL, (void* (*)(void*))worker1, val_0);
pthread_create(&race_thread_2, NULL, (void* (*)(void*))worker2, val_1);
/* Wait for the threads to end */
pthread_join(race_thread_1, NULL);
pthread_join(race_thread_2, NULL);
int64_t end_time = current_time();
printf("time : %d us\n", end_time - start_time);
return 0;
}
预期性能结果:
[jingyan.kfy@OceanBase224006 work]$ perf stat -e cache-misses ./false_share 100000000 0
size of unaligned data : 16
size of aligned data : 128
worker2 start...
worker1 start...
worker1 exit...
worker2 exit...
time : 452451 us
Performance counter stats for './false_share 100000000 0':
3,105,245 cache-misses
0.455033803 seconds time elapsed
[jingyan.kfy@OceanBase224006 work]$ perf stat -e cache-misses ./false_share 100000000 1
size of unaligned data : 16
size of aligned data : 128
worker1 start...
worker2 start...
worker1 exit...
worker2 exit...
time : 326994 us
Performance counter stats for './false_share 100000000 1':
27,735 cache-misses
0.329737667 seconds time elapsed
然而,我 运行 自己编写代码并得到非常接近 运行 的时间,未对齐时缓存未命中计数甚至更低:
我的结果:
$ perf stat -e cache-misses ./false_share 100000000 0
size of unaligned data : 16
size of aligned data : 128
worker1 start...
worker2 start...
worker2 exit...
worker1 exit...
time : 169465 us
Performance counter stats for './false_share 100000000 0':
37,698 cache-misses:u
0.171625603 seconds time elapsed
0.334919000 seconds user
0.001988000 seconds sys
$ perf stat -e cache-misses ./false_share 100000000 1
size of unaligned data : 16
size of aligned data : 128
worker2 start...
worker1 start...
worker2 exit...
worker1 exit...
time : 118798 us
Performance counter stats for './false_share 100000000 1':
38,375 cache-misses:u
0.121072715 seconds time elapsed
0.230043000 seconds user
0.001973000 seconds sys
我应该如何理解这种不一致?
由于您引用的博客是中文的,因此很难提供帮助。不过,我注意到第一个图似乎显示了多路架构。所以我做了一些实验。
a) 我的 PC,Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz,单插槽,两个内核,每个内核两个 threes:
0:
time : 195389 us
Performance counter stats for './a.out 100000000 0':
8 980 cache-misses:u
0,198584628 seconds time elapsed
0,391694000 seconds user
0,000000000 seconds sys
和 1:
time : 191413 us
Performance counter stats for './a.out 100000000 1':
9 020 cache-misses:u
0,192953853 seconds time elapsed
0,378434000 seconds user
0,000000000 seconds sys
差别不大
b) 现在是 2 路工作站
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
0:
time : 454679 us
Performance counter stats for './a.out 100000000 0':
5,644,133 cache-misses
0.456665966 seconds time elapsed
0.738173000 seconds user
1:
time : 346871 us
Performance counter stats for './a.out 100000000 1':
42,217 cache-misses
0.348814583 seconds time elapsed
0.539676000 seconds user
0.000000000 seconds sys
差别很大
最后一点。你写:
the cache miss count is even lower when NOT ALIGNED
不,不是。除了您的程序之外,您的处理器还有 运行 各种任务。此外,您是 运行 2 个线程,它们可能以不同的时间顺序访问缓存。所有这些都可能影响缓存利用率。您需要多次重复测量并进行比较。就个人而言,当我看到任何性能结果的差异小于 10% 时,我认为它们无法区分。
更新
我还对您的代码进行了扩展到 3 个线程的实验,因此其中一些肯定必须 运行 在不同的内核上,因此,仅共享 L3 缓存。
我查看了 How to catch the L3-cache hits and misses by perf tool in Linux 并得到了这个命令:
perf stat -e cache-misses,cache-references,LLC-loads,LLC-stores,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./a.out 100000000 0
0:
time : 214253 us
Performance counter stats for './a.out 100000000 0':
4 765 cache-misses:u # 0,018 % of all cache refs (57,39%)
25 992 887 cache-references:u (57,56%)
17 430 736 LLC-loads:u (57,56%)
8 591 378 LLC-stores:u (57,56%)
28 110 342 L1-dcache-load-misses:u (57,40%)
14 661 378 L1-dcache-prefetch-misses:u (57,80%)
32 269 L1-dcache-store-misses:u (57,49%)
0,215484922 seconds time elapsed
0,627426000 seconds user
0,006635000 seconds sys
1:
time : 194253 us
Performance counter stats for './a.out 100000000 1':
4 509 cache-misses:u # 30,715 % of all cache refs (57,15%)
14 680 cache-references:u (57,45%)
7 954 LLC-loads:u (57,49%)
1 565 LLC-stores:u (57,92%)
4 442 L1-dcache-load-misses:u (57,91%)
836 L1-dcache-prefetch-misses:u (57,02%)
984 L1-dcache-store-misses:u (56,85%)
0,195145645 seconds time elapsed
0,569986000 seconds user
0,000000000 seconds sys
因此:
- 对齐(3 线程)版本系统地(有点)比未对齐(我重复测试了几次)更快,即使在单路机器上也是如此。
- 不太清楚“缓存未命中”选项实际报告的是什么
- L1 缓存、LLC 缓存和缓存引用数量中的“虚假数据共享”会受到巨大的(数字)惩罚。
- 请记住,这些是基于硬件的统计数据:如果其他进程是 运行,它们会将自己的贡献添加到这些结果中
我得到了这段代码,演示了缓存行对齐优化如何通过从 http://blog.kongfy.com/2016/10/cache-coherence-sequential-consistency-and-memory-barrier/
减少 'false sharing' 来工作代码:
/*
* Demo program for showing the drawback of "false sharing"
*
* Use it with perf!
*
* Compile: g++ -O2 -o false_share false_share.cpp -lpthread
* Usage: perf stat -e cache-misses ./false_share <loopcount> <is_aligned>
*/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
#define CACHE_ALIGN_SIZE 64
#define CACHE_ALIGNED __attribute__((aligned(CACHE_ALIGN_SIZE)))
int gLoopCount;
inline int64_t current_time()
{
struct timeval t;
if (gettimeofday(&t, NULL) < 0) {
}
return (static_cast<int64_t>(t.tv_sec) * static_cast<int64_t>(1000000) + static_cast<int64_t>(t.tv_usec));
}
struct value {
int64_t val;
};
value data[2] CACHE_ALIGNED;
struct aligned_value {
int64_t val;
} CACHE_ALIGNED;
aligned_value aligned_data[2] CACHE_ALIGNED;
void* worker1(int64_t *val)
{
printf("worker1 start...\n");
volatile int64_t &v = *val;
for (int i = 0; i < gLoopCount; ++i) {
v += 1;
}
printf("worker1 exit...\n");
}
// duplicate worker function for perf report
void* worker2(int64_t *val)
{
printf("worker2 start...\n");
volatile int64_t &v = *val;
for (int i = 0; i < gLoopCount; ++i) {
v += 1;
}
printf("worker2 exit...\n");
}
int main(int argc, char *argv[])
{
pthread_t race_thread_1;
pthread_t race_thread_2;
bool is_aligned;
/* Check arguments to program*/
if(argc != 3) {
fprintf(stderr, "USAGE: %s <loopcount> <is_aligned>\n", argv[0]);
exit(1);
}
/* Parse argument */
gLoopCount = atoi(argv[1]); /* Don't bother with format checking */
is_aligned = atoi(argv[2]); /* Don't bother with format checking */
printf("size of unaligned data : %d\n", sizeof(data));
printf("size of aligned data : %d\n", sizeof(aligned_data));
void *val_0, *val_1;
if (is_aligned) {
val_0 = (void *)&aligned_data[0].val;
val_1 = (void *)&aligned_data[1].val;
} else {
val_0 = (void *)&data[0].val;
val_1 = (void *)&data[1].val;
}
int64_t start_time = current_time();
/* Start the threads */
pthread_create(&race_thread_1, NULL, (void* (*)(void*))worker1, val_0);
pthread_create(&race_thread_2, NULL, (void* (*)(void*))worker2, val_1);
/* Wait for the threads to end */
pthread_join(race_thread_1, NULL);
pthread_join(race_thread_2, NULL);
int64_t end_time = current_time();
printf("time : %d us\n", end_time - start_time);
return 0;
}
预期性能结果:
[jingyan.kfy@OceanBase224006 work]$ perf stat -e cache-misses ./false_share 100000000 0
size of unaligned data : 16
size of aligned data : 128
worker2 start...
worker1 start...
worker1 exit...
worker2 exit...
time : 452451 us
Performance counter stats for './false_share 100000000 0':
3,105,245 cache-misses
0.455033803 seconds time elapsed
[jingyan.kfy@OceanBase224006 work]$ perf stat -e cache-misses ./false_share 100000000 1
size of unaligned data : 16
size of aligned data : 128
worker1 start...
worker2 start...
worker1 exit...
worker2 exit...
time : 326994 us
Performance counter stats for './false_share 100000000 1':
27,735 cache-misses
0.329737667 seconds time elapsed
然而,我 运行 自己编写代码并得到非常接近 运行 的时间,未对齐时缓存未命中计数甚至更低:
我的结果:
$ perf stat -e cache-misses ./false_share 100000000 0
size of unaligned data : 16
size of aligned data : 128
worker1 start...
worker2 start...
worker2 exit...
worker1 exit...
time : 169465 us
Performance counter stats for './false_share 100000000 0':
37,698 cache-misses:u
0.171625603 seconds time elapsed
0.334919000 seconds user
0.001988000 seconds sys
$ perf stat -e cache-misses ./false_share 100000000 1
size of unaligned data : 16
size of aligned data : 128
worker2 start...
worker1 start...
worker2 exit...
worker1 exit...
time : 118798 us
Performance counter stats for './false_share 100000000 1':
38,375 cache-misses:u
0.121072715 seconds time elapsed
0.230043000 seconds user
0.001973000 seconds sys
我应该如何理解这种不一致?
由于您引用的博客是中文的,因此很难提供帮助。不过,我注意到第一个图似乎显示了多路架构。所以我做了一些实验。
a) 我的 PC,Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz,单插槽,两个内核,每个内核两个 threes:
0:
time : 195389 us
Performance counter stats for './a.out 100000000 0':
8 980 cache-misses:u
0,198584628 seconds time elapsed
0,391694000 seconds user
0,000000000 seconds sys
和 1:
time : 191413 us
Performance counter stats for './a.out 100000000 1':
9 020 cache-misses:u
0,192953853 seconds time elapsed
0,378434000 seconds user
0,000000000 seconds sys
差别不大
b) 现在是 2 路工作站
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
0:
time : 454679 us
Performance counter stats for './a.out 100000000 0':
5,644,133 cache-misses
0.456665966 seconds time elapsed
0.738173000 seconds user
1:
time : 346871 us
Performance counter stats for './a.out 100000000 1':
42,217 cache-misses
0.348814583 seconds time elapsed
0.539676000 seconds user
0.000000000 seconds sys
差别很大
最后一点。你写:
the cache miss count is even lower when NOT ALIGNED
不,不是。除了您的程序之外,您的处理器还有 运行 各种任务。此外,您是 运行 2 个线程,它们可能以不同的时间顺序访问缓存。所有这些都可能影响缓存利用率。您需要多次重复测量并进行比较。就个人而言,当我看到任何性能结果的差异小于 10% 时,我认为它们无法区分。
更新
我还对您的代码进行了扩展到 3 个线程的实验,因此其中一些肯定必须 运行 在不同的内核上,因此,仅共享 L3 缓存。
我查看了 How to catch the L3-cache hits and misses by perf tool in Linux 并得到了这个命令:
perf stat -e cache-misses,cache-references,LLC-loads,LLC-stores,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./a.out 100000000 0
0:
time : 214253 us
Performance counter stats for './a.out 100000000 0':
4 765 cache-misses:u # 0,018 % of all cache refs (57,39%)
25 992 887 cache-references:u (57,56%)
17 430 736 LLC-loads:u (57,56%)
8 591 378 LLC-stores:u (57,56%)
28 110 342 L1-dcache-load-misses:u (57,40%)
14 661 378 L1-dcache-prefetch-misses:u (57,80%)
32 269 L1-dcache-store-misses:u (57,49%)
0,215484922 seconds time elapsed
0,627426000 seconds user
0,006635000 seconds sys
1:
time : 194253 us
Performance counter stats for './a.out 100000000 1':
4 509 cache-misses:u # 30,715 % of all cache refs (57,15%)
14 680 cache-references:u (57,45%)
7 954 LLC-loads:u (57,49%)
1 565 LLC-stores:u (57,92%)
4 442 L1-dcache-load-misses:u (57,91%)
836 L1-dcache-prefetch-misses:u (57,02%)
984 L1-dcache-store-misses:u (56,85%)
0,195145645 seconds time elapsed
0,569986000 seconds user
0,000000000 seconds sys
因此:
- 对齐(3 线程)版本系统地(有点)比未对齐(我重复测试了几次)更快,即使在单路机器上也是如此。
- 不太清楚“缓存未命中”选项实际报告的是什么
- L1 缓存、LLC 缓存和缓存引用数量中的“虚假数据共享”会受到巨大的(数字)惩罚。
- 请记住,这些是基于硬件的统计数据:如果其他进程是 运行,它们会将自己的贡献添加到这些结果中