使用 PAPI_read_counters 计算 L1 缓存未命中数会产生意外结果
counting L1 cache misses with PAPI_read_counters gives unexpected results
我正在尝试使用 PAPI 库来计算缓存未命中数。缓存命中性能计数器在我的硬件上不可用,这就是为什么我试图确定没有缓存未命中的缓存命中。我正在尝试一些事情。我的代码的第一个版本是这样的:
int numEvents = 2;
long long values[2];
int events[2] = {PAPI_L1_DCM, PAPI_L2_TCM};
if (PAPI_start_counters(events, numEvents) != PAPI_OK ) // !=PAPI_OK
printf("PAPI error: %d\n", 1);
for(int i=0; i < arr_size; i++)
{
array[i].value = 1;
}
_mm_mfence();
if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
exit(1);
}
miss1 = values[0];
_mm_mfence();
for(int i=0; i < arr_size; i++){
array[i].value = array[i].value + 9; // (int) sum
}
_mm_mfence();
if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
exit(1);
}
miss2 = values[0];
printf("before flush miss_1 %lli, miss_2 %lli \n", miss1, miss2);
问题是这段代码应该给我缓存命中率,所以 L1 缓存未命中率应该极低。但是 miss_2 我得到了意想不到的高结果。数组大小为 200 时,miss_2 接近 100。它没有给出任何有效结果来判断它是否真的被命中,因为缓存未命中的次数很多。
我也试过这样重写:
if (PAPI_start_counters(events, numEvents) != PAPI_OK ) // !=PAPI_OK
printf("PAPI error: %d\n", 1);
for(int i=0; i < arr_size; i++){
array[i].value = array[i].value + 9; // (int) sum
}
if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
printf("PAPI error: 2\n");
printf("before flush miss %lli\n", values[0]);
但这给出了更糟糕的结果,miss_2 超过 200。我有什么地方做得不对吗?它应该给出更精确的结果,但它现在做得很糟糕。或者我遗漏了什么。
我试过没有围栏,我相信至少它们不会造成任何伤害。如果有任何建议,我将不胜感激。
PAPI_read_counters 的缺点是它的开销,而且性能不是很好,但现在我不关心性能,我想正确地确定缓存命中。
虽然我也在考虑使用 RDMPC,但我还没有找到在没有 _asm 函数覆盖的情况下使用它的示例。这真的是使用 rdpmc 的唯一方法吗?不存在我不必覆盖的已经定义的函数?
编辑:
为 PAPI_read
添加编译器代码
./prog6: file format elf64-x86-64
Disassembly of section .init:
00000000000009c0 <_init>:
9c0: 48 83 ec 08 sub [=13=]x8,%rsp
9c4: 48 8b 05 1d 16 20 00 mov 0x20161d(%rip),%rax # 201fe8 <__gmon_start__>
9cb: 48 85 c0 test %rax,%rax
9ce: 74 02 je 9d2 <_init+0x12>
9d0: ff d0 callq *%rax
9d2: 48 83 c4 08 add [=13=]x8,%rsp
9d6: c3 retq
Disassembly of section .plt:
00000000000009e0 <.plt>:
9e0: ff 35 6a 15 20 00 pushq 0x20156a(%rip) # 201f50 <_GLOBAL_OFFSET_TABLE_+0x8>
9e6: ff 25 6c 15 20 00 jmpq *0x20156c(%rip) # 201f58 <_GLOBAL_OFFSET_TABLE_+0x10>
9ec: 0f 1f 40 00 nopl 0x0(%rax)
00000000000009f0 <puts@plt>:
9f0: ff 25 6a 15 20 00 jmpq *0x20156a(%rip) # 201f60 <puts@GLIBC_2.2.5>
9f6: 68 00 00 00 00 pushq [=13=]x0
9fb: e9 e0 ff ff ff jmpq 9e0 <.plt>
0000000000000a00 <clock_gettime@plt>:
a00: ff 25 62 15 20 00 jmpq *0x201562(%rip) # 201f68 <clock_gettime@GLIBC_2.17>
a06: 68 01 00 00 00 pushq [=13=]x1
a0b: e9 d0 ff ff ff jmpq 9e0 <.plt>
0000000000000a10 <getpid@plt>:
a10: ff 25 5a 15 20 00 jmpq *0x20155a(%rip) # 201f70 <getpid@GLIBC_2.2.5>
a16: 68 02 00 00 00 pushq [=13=]x2
a1b: e9 c0 ff ff ff jmpq 9e0 <.plt>
0000000000000a20 <__stack_chk_fail@plt>:
a20: ff 25 52 15 20 00 jmpq *0x201552(%rip) # 201f78 <__stack_chk_fail@GLIBC_2.4>
a26: 68 03 00 00 00 pushq [=13=]x3
a2b: e9 b0 ff ff ff jmpq 9e0 <.plt>
0000000000000a30 <PAPI_read_counters@plt>:
a30: ff 25 4a 15 20 00 jmpq *0x20154a(%rip) # 201f80 <PAPI_read_counters>
a36: 68 04 00 00 00 pushq [=13=]x4
a3b: e9 a0 ff ff ff jmpq 9e0 <.plt>
0000000000000a40 <sched_setaffinity@plt>:
a40: ff 25 42 15 20 00 jmpq *0x201542(%rip) # 201f88 <sched_setaffinity@GLIBC_2.3.4>
a46: 68 05 00 00 00 pushq [=13=]x5
a4b: e9 90 ff ff ff jmpq 9e0 <.plt>
0000000000000a50 <PAPI_start_counters@plt>:
a50: ff 25 3a 15 20 00 jmpq *0x20153a(%rip) # 201f90 <PAPI_start_counters>
a56: 68 06 00 00 00 pushq [=13=]x6
a5b: e9 80 ff ff ff jmpq 9e0 <.plt>
0000000000000a60 <PAPI_stop_counters@plt>:
a60: ff 25 32 15 20 00 jmpq *0x201532(%rip) # 201f98 <PAPI_stop_counters>
a66: 68 07 00 00 00 pushq [=13=]x7
a6b: e9 70 ff ff ff jmpq 9e0 <.plt>
0000000000000a70 <malloc@plt>:
a70: ff 25 2a 15 20 00 jmpq *0x20152a(%rip) # 201fa0 <malloc@GLIBC_2.2.5>
a76: 68 08 00 00 00 pushq [=13=]x8
a7b: e9 60 ff ff ff jmpq 9e0 <.plt>
0000000000000a80 <PAPI_strerror@plt>:
a80: ff 25 22 15 20 00 jmpq *0x201522(%rip) # 201fa8 <PAPI_strerror>
a86: 68 09 00 00 00 pushq [=13=]x9
a8b: e9 50 ff ff ff jmpq 9e0 <.plt>
0000000000000a90 <__printf_chk@plt>:
a90: ff 25 1a 15 20 00 jmpq *0x20151a(%rip) # 201fb0 <__printf_chk@GLIBC_2.3.4>
a96: 68 0a 00 00 00 pushq [=13=]xa
a9b: e9 40 ff ff ff jmpq 9e0 <.plt>
0000000000000aa0 <getrusage@plt>:
aa0: ff 25 12 15 20 00 jmpq *0x201512(%rip) # 201fb8 <getrusage@GLIBC_2.2.5>
aa6: 68 0b 00 00 00 pushq [=13=]xb
aab: e9 30 ff ff ff jmpq 9e0 <.plt>
0000000000000ab0 <exit@plt>:
ab0: ff 25 0a 15 20 00 jmpq *0x20150a(%rip) # 201fc0 <exit@GLIBC_2.2.5>
ab6: 68 0c 00 00 00 pushq [=13=]xc
abb: e9 20 ff ff ff jmpq 9e0 <.plt>
0000000000000ac0 <fwrite@plt>:
ac0: ff 25 02 15 20 00 jmpq *0x201502(%rip) # 201fc8 <fwrite@GLIBC_2.2.5>
ac6: 68 0d 00 00 00 pushq [=13=]xd
acb: e9 10 ff ff ff jmpq 9e0 <.plt>
0000000000000ad0 <__fprintf_chk@plt>:
ad0: ff 25 fa 14 20 00 jmpq *0x2014fa(%rip) # 201fd0 <__fprintf_chk@GLIBC_2.3.4>
ad6: 68 0e 00 00 00 pushq [=13=]xe
adb: e9 00 ff ff ff jmpq 9e0 <.plt>
Disassembly of section .plt.got:
0000000000000ae0 <__cxa_finalize@plt>:
ae0: ff 25 12 15 20 00 jmpq *0x201512(%rip) # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
ae6: 66 90 xchg %ax,%ax
Disassembly of section .text:
0000000000000af0 <main>:
af0: 41 57 push %r15
af2: b9 0f 00 00 00 mov [=13=]xf,%ecx
af7: 41 56 push %r14
af9: 41 55 push %r13
afb: 41 54 push %r12
afd: 55 push %rbp
afe: 53 push %rbx
aff: 48 81 ec 78 01 00 00 sub [=13=]x178,%rsp
b06: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
b0d: 00 00
b0f: 48 89 84 24 68 01 00 mov %rax,0x168(%rsp)
b16: 00
b17: 31 c0 xor %eax,%eax
b19: 48 8d 9c 24 e0 00 00 lea 0xe0(%rsp),%rbx
b20: 00
b21: 48 b8 00 00 00 80 07 movabs [=13=]x8000000780000000,%rax
b28: 00 00 80
b2b: 48 c7 84 24 e0 00 00 movq [=13=]x1,0xe0(%rsp)
b32: 00 01 00 00 00
b37: 48 8d 53 08 lea 0x8(%rbx),%rdx
b3b: 48 89 84 24 c8 00 00 mov %rax,0xc8(%rsp)
b42: 00
b43: 31 c0 xor %eax,%eax
b45: 48 89 d7 mov %rdx,%rdi
b48: f3 48 ab rep stos %rax,%es:(%rdi)
b4b: e8 c0 fe ff ff callq a10 <getpid@plt>
b50: 48 89 da mov %rbx,%rdx
b53: be 80 00 00 00 mov [=13=]x80,%esi
b58: 89 c7 mov %eax,%edi
b5a: e8 e1 fe ff ff callq a40 <sched_setaffinity@plt>
b5f: 85 c0 test %eax,%eax
b61: 0f 85 17 03 00 00 jne e7e <main+0x38e>
b67: 0f ae f0 mfence
b6a: 48 8d 74 24 10 lea 0x10(%rsp),%rsi
b6f: bf 02 00 00 00 mov [=13=]x2,%edi
b74: 0f ae f0 mfence
b77: e8 84 fe ff ff callq a00 <clock_gettime@plt>
b7c: 0f 31 rdtsc
b7e: bf 00 fa 00 00 mov [=13=]xfa00,%edi
b83: 0f ae f0 mfence
b86: 48 c1 e2 20 shl [=13=]x20,%rdx
b8a: 49 89 c6 mov %rax,%r14
b8d: 49 09 d6 or %rdx,%r14
b90: e8 db fe ff ff callq a70 <malloc@plt>
b95: 48 8d bc 24 c8 00 00 lea 0xc8(%rsp),%rdi
b9c: 00
b9d: be 02 00 00 00 mov [=13=]x2,%esi
ba2: 49 89 c4 mov %rax,%r12
ba5: e8 a6 fe ff ff callq a50 <PAPI_start_counters@plt>
baa: 85 c0 test %eax,%eax
bac: 0f 85 88 02 00 00 jne e3a <main+0x34a>
bb2: 4d 89 e7 mov %r12,%r15
bb5: 49 8d 84 24 00 fa 00 lea 0xfa00(%r12),%rax
bbc: 00
bbd: 4c 89 e5 mov %r12,%rbp
bc0: c7 45 00 01 00 00 00 movl [=13=]x1,0x0(%rbp)
bc7: 48 83 c5 40 add [=13=]x40,%rbp
bcb: 48 39 e8 cmp %rbp,%rax
bce: 75 f0 jne bc0 <main+0xd0>
bd0: 4c 8d ac 24 d0 00 00 lea 0xd0(%rsp),%r13
bd7: 00
bd8: be 02 00 00 00 mov [=13=]x2,%esi
bdd: 4c 89 ef mov %r13,%rdi
be0: e8 4b fe ff ff callq a30 <PAPI_read_counters@plt>
be5: 85 c0 test %eax,%eax
be7: 0f 85 b8 02 00 00 jne ea5 <main+0x3b5>
bed: 48 8b 84 24 d0 00 00 mov 0xd0(%rsp),%rax
bf4: 00
bf5: 4c 89 e3 mov %r12,%rbx
bf8: 48 89 44 24 08 mov %rax,0x8(%rsp)
bfd: 0f 1f 00 nopl (%rax)
c00: 83 03 09 addl [=13=]x9,(%rbx)
c03: 48 83 c3 40 add [=13=]x40,%rbx
c07: 48 39 dd cmp %rbx,%rbp
c0a: 75 f4 jne c00 <main+0x110>
c0c: 31 d2 xor %edx,%edx
c0e: 48 8d 35 88 04 00 00 lea 0x488(%rip),%rsi # 109d <_IO_stdin_used+0x2d>
c15: bf 01 00 00 00 mov [=13=]x1,%edi
c1a: 31 c0 xor %eax,%eax
c1c: e8 6f fe ff ff callq a90 <__printf_chk@plt>
c21: be 02 00 00 00 mov [=13=]x2,%esi
c26: 4c 89 ef mov %r13,%rdi
c29: e8 02 fe ff ff callq a30 <PAPI_read_counters@plt>
c2e: 85 c0 test %eax,%eax
c30: 0f 85 6f 02 00 00 jne ea5 <main+0x3b5>
c36: 48 8b 8c 24 d0 00 00 mov 0xd0(%rsp),%rcx
c3d: 00
c3e: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
c43: 48 8d 35 e6 04 00 00 lea 0x4e6(%rip),%rsi # 1130 <_IO_stdin_used+0xc0>
c4a: 31 c0 xor %eax,%eax
c4c: bf 01 00 00 00 mov [=13=]x1,%edi
c51: e8 3a fe ff ff callq a90 <__printf_chk@plt>
c56: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
c5d: 00 00 00
c60: 41 0f ae 3c 24 clflush (%r12)
c65: 49 83 c4 40 add [=13=]x40,%r12
c69: 49 39 dc cmp %rbx,%r12
c6c: 75 f2 jne c60 <main+0x170>
c6e: be 02 00 00 00 mov [=13=]x2,%esi
c73: 4c 89 ef mov %r13,%rdi
c76: e8 b5 fd ff ff callq a30 <PAPI_read_counters@plt>
c7b: 85 c0 test %eax,%eax
c7d: 0f 85 22 02 00 00 jne ea5 <main+0x3b5>
c83: 48 8b ac 24 d0 00 00 mov 0xd0(%rsp),%rbp
c8a: 00
c8b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
c90: 41 83 07 09 addl [=13=]x9,(%r15)
c94: 49 83 c7 40 add [=13=]x40,%r15
c98: 49 39 df cmp %rbx,%r15
c9b: 75 f3 jne c90 <main+0x1a0>
c9d: be 02 00 00 00 mov [=13=]x2,%esi
ca2: 4c 89 ef mov %r13,%rdi
ca5: e8 86 fd ff ff callq a30 <PAPI_read_counters@plt>
caa: 85 c0 test %eax,%eax
cac: 0f 85 f3 01 00 00 jne ea5 <main+0x3b5>
cb2: 48 8b 8c 24 d0 00 00 mov 0xd0(%rsp),%rcx
cb9: 00
cba: 48 8d 35 97 04 00 00 lea 0x497(%rip),%rsi # 1158 <_IO_stdin_used+0xe8>
cc1: bf 01 00 00 00 mov [=13=]x1,%edi
cc6: 31 c0 xor %eax,%eax
cc8: 48 89 ea mov %rbp,%rdx
ccb: e8 c0 fd ff ff callq a90 <__printf_chk@plt>
cd0: be 02 00 00 00 mov [=13=]x2,%esi
cd5: 4c 89 ef mov %r13,%rdi
cd8: e8 83 fd ff ff callq a60 <PAPI_stop_counters@plt>
cdd: 85 c0 test %eax,%eax
cdf: 0f 85 72 01 00 00 jne e57 <main+0x367>
ce5: 0f ae f0 mfence
ce8: 0f 31 rdtsc
cea: bf 02 00 00 00 mov [=13=]x2,%edi
cef: 48 c1 e2 20 shl [=13=]x20,%rdx
cf3: 48 89 c3 mov %rax,%rbx
cf6: 48 8d 74 24 20 lea 0x20(%rsp),%rsi
cfb: 48 09 d3 or %rdx,%rbx
cfe: e8 fd fc ff ff callq a00 <clock_gettime@plt>
d03: bf 01 00 00 00 mov [=13=]x1,%edi
d08: 48 be db 34 b6 d7 82 movabs [=13=]x431bde82d7b634db,%rsi
d0f: de 1b 43
d12: 0f ae f0 mfence
d15: 48 8b 4c 24 20 mov 0x20(%rsp),%rcx
d1a: 48 2b 4c 24 10 sub 0x10(%rsp),%rcx
d1f: 48 69 c9 00 ca 9a 3b imul [=13=]x3b9aca00,%rcx,%rcx
d26: 48 03 4c 24 28 add 0x28(%rsp),%rcx
d2b: 48 2b 4c 24 18 sub 0x18(%rsp),%rcx
d30: 48 89 c8 mov %rcx,%rax
d33: 48 c1 f9 3f sar [=13=]x3f,%rcx
d37: 48 f7 ee imul %rsi
d3a: 48 8d 35 3f 04 00 00 lea 0x43f(%rip),%rsi # 1180 <_IO_stdin_used+0x110>
d41: 31 c0 xor %eax,%eax
d43: 48 c1 fa 12 sar [=13=]x12,%rdx
d47: 48 29 ca sub %rcx,%rdx
d4a: e8 41 fd ff ff callq a90 <__printf_chk@plt>
d4f: 48 89 da mov %rbx,%rdx
d52: bf 01 00 00 00 mov [=13=]x1,%edi
d57: 31 c0 xor %eax,%eax
d59: 4c 29 f2 sub %r14,%rdx
d5c: 48 8d 35 53 03 00 00 lea 0x353(%rip),%rsi # 10b6 <_IO_stdin_used+0x46>
d63: e8 28 fd ff ff callq a90 <__printf_chk@plt>
d68: 31 d2 xor %edx,%edx
d6a: 48 8d 35 56 03 00 00 lea 0x356(%rip),%rsi # 10c7 <_IO_stdin_used+0x57>
d71: 31 c0 xor %eax,%eax
d73: bf 01 00 00 00 mov [=13=]x1,%edi
d78: e8 13 fd ff ff callq a90 <__printf_chk@plt>
d7d: 31 ff xor %edi,%edi
d7f: 48 8d 74 24 30 lea 0x30(%rsp),%rsi
d84: e8 17 fd ff ff callq aa0 <getrusage@plt>
d89: 83 f8 ff cmp [=13=]xffffffff,%eax
d8c: 0f 84 d6 00 00 00 je e68 <main+0x378>
d92: 48 8b 8c 24 b8 00 00 mov 0xb8(%rsp),%rcx
d99: 00
d9a: 48 8b 94 24 b0 00 00 mov 0xb0(%rsp),%rdx
da1: 00
da2: 48 8d 35 3e 03 00 00 lea 0x33e(%rip),%rsi # 10e7 <_IO_stdin_used+0x77>
da9: 31 c0 xor %eax,%eax
dab: bf 01 00 00 00 mov [=13=]x1,%edi
db0: e8 db fc ff ff callq a90 <__printf_chk@plt>
db5: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
db9: bf 01 00 00 00 mov [=13=]x1,%edi
dbe: c5 fb 10 0d 12 04 00 vmovsd 0x412(%rip),%xmm1 # 11d8 <_IO_stdin_used+0x168>
dc5: 00
dc6: 48 69 44 24 30 40 42 imul [=13=]xf4240,0x30(%rsp),%rax
dcd: 0f 00
dcf: 48 03 44 24 38 add 0x38(%rsp),%rax
dd4: 48 8d 35 d5 03 00 00 lea 0x3d5(%rip),%rsi # 11b0 <_IO_stdin_used+0x140>
ddb: c4 e1 fb 2a c0 vcvtsi2sd %rax,%xmm0,%xmm0
de0: 48 69 54 24 40 40 42 imul [=13=]xf4240,0x40(%rsp),%rdx
de7: 0f 00
de9: 48 03 54 24 48 add 0x48(%rsp),%rdx
dee: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0
df2: c4 e1 fb 2c c0 vcvttsd2si %xmm0,%rax
df7: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
dfb: c4 e1 fb 2a c2 vcvtsi2sd %rdx,%xmm0,%xmm0
e00: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0
e04: c4 e1 fb 2c d0 vcvttsd2si %xmm0,%rdx
e09: 48 01 c2 add %rax,%rdx
e0c: 31 c0 xor %eax,%eax
e0e: e8 7d fc ff ff callq a90 <__printf_chk@plt>
e13: 31 c0 xor %eax,%eax
e15: 48 8b 8c 24 68 01 00 mov 0x168(%rsp),%rcx
e1c: 00
e1d: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx
e24: 00 00
e26: 75 51 jne e79 <main+0x389>
e28: 48 81 c4 78 01 00 00 add [=13=]x178,%rsp
e2f: 5b pop %rbx
e30: 5d pop %rbp
e31: 41 5c pop %r12
e33: 41 5d pop %r13
e35: 41 5e pop %r14
e37: 41 5f pop %r15
e39: c3 retq
e3a: ba 01 00 00 00 mov [=13=]x1,%edx
e3f: 48 8d 35 47 02 00 00 lea 0x247(%rip),%rsi # 108d <_IO_stdin_used+0x1d>
e46: bf 01 00 00 00 mov [=13=]x1,%edi
e4b: 31 c0 xor %eax,%eax
e4d: e8 3e fc ff ff callq a90 <__printf_chk@plt>
e52: e9 5b fd ff ff jmpq bb2 <main+0xc2>
e57: 48 8d 3d 4a 02 00 00 lea 0x24a(%rip),%rdi # 10a8 <_IO_stdin_used+0x38>
e5e: e8 8d fb ff ff callq 9f0 <puts@plt>
e63: e9 7d fe ff ff jmpq ce5 <main+0x1f5>
e68: 48 8d 3d 62 02 00 00 lea 0x262(%rip),%rdi # 10d1 <_IO_stdin_used+0x61>
e6f: e8 7c fb ff ff callq 9f0 <puts@plt>
e74: e9 19 ff ff ff jmpq d92 <main+0x2a2>
e79: e8 a2 fb ff ff callq a20 <__stack_chk_fail@plt>
e7e: 48 8b 0d 9b 11 20 00 mov 0x20119b(%rip),%rcx # 202020 <stderr@@GLIBC_2.2.5>
e85: ba 18 00 00 00 mov [=13=]x18,%edx
e8a: be 01 00 00 00 mov [=13=]x1,%esi
e8f: 48 8d 3d de 01 00 00 lea 0x1de(%rip),%rdi # 1074 <_IO_stdin_used+0x4>
e96: e8 25 fc ff ff callq ac0 <fwrite@plt>
e9b: bf 01 00 00 00 mov [=13=]x1,%edi
ea0: e8 0b fc ff ff callq ab0 <exit@plt>
ea5: 89 c7 mov %eax,%edi
ea7: e8 d4 fb ff ff callq a80 <PAPI_strerror@plt>
eac: 48 8b 3d 6d 11 20 00 mov 0x20116d(%rip),%rdi # 202020 <stderr@@GLIBC_2.2.5>
eb3: be 01 00 00 00 mov [=13=]x1,%esi
eb8: 48 8d 15 49 02 00 00 lea 0x249(%rip),%rdx # 1108 <_IO_stdin_used+0x98>
ebf: 48 89 c1 mov %rax,%rcx
ec2: 31 c0 xor %eax,%eax
ec4: e8 07 fc ff ff callq ad0 <__fprintf_chk@plt>
ec9: bf 01 00 00 00 mov [=13=]x1,%edi
ece: e8 dd fb ff ff callq ab0 <exit@plt>
ed3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
eda: 00 00 00
edd: 0f 1f 00 nopl (%rax)
0000000000000ee0 <_start>:
ee0: 31 ed xor %ebp,%ebp
ee2: 49 89 d1 mov %rdx,%r9
ee5: 5e pop %rsi
ee6: 48 89 e2 mov %rsp,%rdx
ee9: 48 83 e4 f0 and [=13=]xfffffffffffffff0,%rsp
eed: 50 push %rax
eee: 54 push %rsp
eef: 4c 8d 05 6a 01 00 00 lea 0x16a(%rip),%r8 # 1060 <__libc_csu_fini>
ef6: 48 8d 0d f3 00 00 00 lea 0xf3(%rip),%rcx # ff0 <__libc_csu_init>
efd: 48 8d 3d ec fb ff ff lea -0x414(%rip),%rdi # af0 <main>
f04: ff 15 d6 10 20 00 callq *0x2010d6(%rip) # 201fe0 <__libc_start_main@GLIBC_2.2.5>
f0a: f4 hlt
f0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000000f10 <deregister_tm_clones>:
f10: 48 8d 3d f9 10 20 00 lea 0x2010f9(%rip),%rdi # 202010 <__TMC_END__>
f17: 55 push %rbp
f18: 48 8d 05 f1 10 20 00 lea 0x2010f1(%rip),%rax # 202010 <__TMC_END__>
f1f: 48 39 f8 cmp %rdi,%rax
f22: 48 89 e5 mov %rsp,%rbp
f25: 74 19 je f40 <deregister_tm_clones+0x30>
f27: 48 8b 05 aa 10 20 00 mov 0x2010aa(%rip),%rax # 201fd8 <_ITM_deregisterTMCloneTable>
f2e: 48 85 c0 test %rax,%rax
f31: 74 0d je f40 <deregister_tm_clones+0x30>
f33: 5d pop %rbp
f34: ff e0 jmpq *%rax
f36: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f3d: 00 00 00
f40: 5d pop %rbp
f41: c3 retq
f42: 0f 1f 40 00 nopl 0x0(%rax)
f46: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f4d: 00 00 00
0000000000000f50 <register_tm_clones>:
f50: 48 8d 3d b9 10 20 00 lea 0x2010b9(%rip),%rdi # 202010 <__TMC_END__>
f57: 48 8d 35 b2 10 20 00 lea 0x2010b2(%rip),%rsi # 202010 <__TMC_END__>
f5e: 55 push %rbp
f5f: 48 29 fe sub %rdi,%rsi
f62: 48 89 e5 mov %rsp,%rbp
f65: 48 c1 fe 03 sar [=13=]x3,%rsi
f69: 48 89 f0 mov %rsi,%rax
f6c: 48 c1 e8 3f shr [=13=]x3f,%rax
f70: 48 01 c6 add %rax,%rsi
f73: 48 d1 fe sar %rsi
f76: 74 18 je f90 <register_tm_clones+0x40>
f78: 48 8b 05 71 10 20 00 mov 0x201071(%rip),%rax # 201ff0 <_ITM_registerTMCloneTable>
f7f: 48 85 c0 test %rax,%rax
f82: 74 0c je f90 <register_tm_clones+0x40>
f84: 5d pop %rbp
f85: ff e0 jmpq *%rax
f87: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
f8e: 00 00
f90: 5d pop %rbp
f91: c3 retq
f92: 0f 1f 40 00 nopl 0x0(%rax)
f96: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f9d: 00 00 00
0000000000000fa0 <__do_global_dtors_aux>:
fa0: 80 3d 81 10 20 00 00 cmpb [=13=]x0,0x201081(%rip) # 202028 <completed.7696>
fa7: 75 2f jne fd8 <__do_global_dtors_aux+0x38>
fa9: 48 83 3d 47 10 20 00 cmpq [=13=]x0,0x201047(%rip) # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
fb0: 00
fb1: 55 push %rbp
fb2: 48 89 e5 mov %rsp,%rbp
fb5: 74 0c je fc3 <__do_global_dtors_aux+0x23>
fb7: 48 8b 3d 4a 10 20 00 mov 0x20104a(%rip),%rdi # 202008 <__dso_handle>
fbe: e8 1d fb ff ff callq ae0 <__cxa_finalize@plt>
fc3: e8 48 ff ff ff callq f10 <deregister_tm_clones>
fc8: c6 05 59 10 20 00 01 movb [=13=]x1,0x201059(%rip) # 202028 <completed.7696>
fcf: 5d pop %rbp
fd0: c3 retq
fd1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
fd8: f3 c3 repz retq
fda: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000fe0 <frame_dummy>:
fe0: 55 push %rbp
fe1: 48 89 e5 mov %rsp,%rbp
fe4: 5d pop %rbp
fe5: e9 66 ff ff ff jmpq f50 <register_tm_clones>
fea: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000ff0 <__libc_csu_init>:
ff0: 41 57 push %r15
ff2: 41 56 push %r14
ff4: 49 89 d7 mov %rdx,%r15
ff7: 41 55 push %r13
ff9: 41 54 push %r12
ffb: 4c 8d 25 36 0d 20 00 lea 0x200d36(%rip),%r12 # 201d38 <__frame_dummy_init_array_entry>
1002: 55 push %rbp
1003: 48 8d 2d 36 0d 20 00 lea 0x200d36(%rip),%rbp # 201d40 <__init_array_end>
100a: 53 push %rbx
100b: 41 89 fd mov %edi,%r13d
100e: 49 89 f6 mov %rsi,%r14
1011: 4c 29 e5 sub %r12,%rbp
1014: 48 83 ec 08 sub [=13=]x8,%rsp
1018: 48 c1 fd 03 sar [=13=]x3,%rbp
101c: e8 9f f9 ff ff callq 9c0 <_init>
1021: 48 85 ed test %rbp,%rbp
1024: 74 20 je 1046 <__libc_csu_init+0x56>
1026: 31 db xor %ebx,%ebx
1028: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
102f: 00
1030: 4c 89 fa mov %r15,%rdx
1033: 4c 89 f6 mov %r14,%rsi
1036: 44 89 ef mov %r13d,%edi
1039: 41 ff 14 dc callq *(%r12,%rbx,8)
103d: 48 83 c3 01 add [=13=]x1,%rbx
1041: 48 39 dd cmp %rbx,%rbp
1044: 75 ea jne 1030 <__libc_csu_init+0x40>
1046: 48 83 c4 08 add [=13=]x8,%rsp
104a: 5b pop %rbx
104b: 5d pop %rbp
104c: 41 5c pop %r12
104e: 41 5d pop %r13
1050: 41 5e pop %r14
1052: 41 5f pop %r15
1054: c3 retq
1055: 90 nop
1056: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
105d: 00 00 00
0000000000001060 <__libc_csu_fini>:
1060: f3 c3 repz retq
Disassembly of section .fini:
0000000000001064 <_fini>:
1064: 48 83 ec 08 sub [=13=]x8,%rsp
1068: 48 83 c4 08 add [=13=]x8,%rsp
106c: c3 retq
我的对象大小为 64,我还添加了初始化:
typedef struct _object{
int value;
int pad_0;
int * pad_2;
int * pad_3;
int * pad_4;
int * pad_5;
int * pad_6;
int * pad_7;
int * pad_8;
} object;
object * array;
int arr_size = 1000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
array[i].value = 1;
}
我在 Haswell 上使用类似于 PAPI 的 LIKWID 做了一些实验。我发现对初始化和读取性能计数器的函数的调用可能导致 L1 缓存中的 600 多次替换。由于 L1 缓存只有 512 行,这意味着这些函数可能会驱逐您原本希望在 L1 中的许多行。通过查看比较大的源代码PAPI_start_counters and _internal_hl_read_cnts, it seems to me that these functions may evict many lines from the L1, so the array elements don't survive in the L1 across these calls. I've verified this by using loads instead of stores and counting hits and misses using MEM_LOAD_RETIRED.*
. I think the solution would be to use the RDPMC
instruction. I have not used this instruction directly before. The code snippets here 看看很有用。
或者,您可以在 PAPI_start_counters
/PAPI_read_counters
之后放置两个循环副本,然后从结果中减去一个循环副本的计数。这个方法效果不错。
顺便说一句,当访问的缓存行数大约大于 10 时,L1D.REPLACEMENT
计数器在 Haswell 上似乎相当准确。也许使用 RDPMC
计数会更准确。
从你之前的问题来看,你似乎在使用 Skylake。根据 PAPI event mapping,PAPI_L1_DCM
和 PAPI_L2_TCM
映射到英特尔处理器上的 L1D.REPLACEMENT
和 LONGEST_LAT_CACHE.REFERENCE
性能监控事件。这些在英特尔手册中定义如下:
L1D.REPLACEMENT: Counts L1D data line replacements including opportunistic
replacements, and replacements that require stall-for-replace or
block-for-replace.
LONGEST_LAT_CACHE.REFERENCE: This event counts core-originated cacheable demand requests that refer
to the last level cache (LLC). Demand requests include loads, RFOs,
and hardware prefetches from L1D, and instruction fetches from IFU.
无需深入了解这些事件确切发生的时间,这里有三点与您的问题相关的要点:
- 这两个事件都按缓存行粒度计算,而不是 x86 指令或加载 uop 粒度。
- 这些事件可能由于 L1D 硬件预取器而发生。这会影响
miss2
.
- 对于使用这些事件(或基于 SnB 的微体系结构上的任何其他事件集)的特定物理或逻辑内核,无法在高速缓存行粒度上计算 L1D 命中数。
在 Skylake 上,还有其他本机事件可用于计算每个加载指令的 L1D 未命中数和命中数。您可以使用 MEM_LOAD_RETIRED.L1_HIT
来计算命中 L1D 的退休加载指令的数量。您可以使用 MEM_INST_RETIRED.ALL_LOADS
-MEM_LOAD_RETIRED.L1_HIT
来计算 L1D 中未命中的退休加载指令的数量。他们似乎没有 PAPI 事件。根据 documentation,您可以将本机事件代码传递给 PAPIF_start_counters
。
另一个问题是我不清楚 PAPIF_start_counters
默认情况下是否只计算内核事件和用户事件的用户事件。好像可以用PAPI_create_eventset
来控制计数domain.
对 PAPI API 的调用也会影响事件计数。您可以尝试使用一个空块来测量它,如下所示:
if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
exit(1);
}
// Nothing.
if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
exit(1);
}
此测量将为您提供可能由于 PAPI 本身而发生的误差的估计值。
此外,我认为您不需要使用 _mm_mfence
。
我正在尝试使用 PAPI 库来计算缓存未命中数。缓存命中性能计数器在我的硬件上不可用,这就是为什么我试图确定没有缓存未命中的缓存命中。我正在尝试一些事情。我的代码的第一个版本是这样的:
int numEvents = 2;
long long values[2];
int events[2] = {PAPI_L1_DCM, PAPI_L2_TCM};
if (PAPI_start_counters(events, numEvents) != PAPI_OK ) // !=PAPI_OK
printf("PAPI error: %d\n", 1);
for(int i=0; i < arr_size; i++)
{
array[i].value = 1;
}
_mm_mfence();
if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
exit(1);
}
miss1 = values[0];
_mm_mfence();
for(int i=0; i < arr_size; i++){
array[i].value = array[i].value + 9; // (int) sum
}
_mm_mfence();
if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
exit(1);
}
miss2 = values[0];
printf("before flush miss_1 %lli, miss_2 %lli \n", miss1, miss2);
问题是这段代码应该给我缓存命中率,所以 L1 缓存未命中率应该极低。但是 miss_2 我得到了意想不到的高结果。数组大小为 200 时,miss_2 接近 100。它没有给出任何有效结果来判断它是否真的被命中,因为缓存未命中的次数很多。
我也试过这样重写:
if (PAPI_start_counters(events, numEvents) != PAPI_OK ) // !=PAPI_OK
printf("PAPI error: %d\n", 1);
for(int i=0; i < arr_size; i++){
array[i].value = array[i].value + 9; // (int) sum
}
if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
printf("PAPI error: 2\n");
printf("before flush miss %lli\n", values[0]);
但这给出了更糟糕的结果,miss_2 超过 200。我有什么地方做得不对吗?它应该给出更精确的结果,但它现在做得很糟糕。或者我遗漏了什么。
我试过没有围栏,我相信至少它们不会造成任何伤害。如果有任何建议,我将不胜感激。
PAPI_read_counters 的缺点是它的开销,而且性能不是很好,但现在我不关心性能,我想正确地确定缓存命中。
虽然我也在考虑使用 RDMPC,但我还没有找到在没有 _asm 函数覆盖的情况下使用它的示例。这真的是使用 rdpmc 的唯一方法吗?不存在我不必覆盖的已经定义的函数?
编辑: 为 PAPI_read
添加编译器代码 ./prog6: file format elf64-x86-64
Disassembly of section .init:
00000000000009c0 <_init>:
9c0: 48 83 ec 08 sub [=13=]x8,%rsp
9c4: 48 8b 05 1d 16 20 00 mov 0x20161d(%rip),%rax # 201fe8 <__gmon_start__>
9cb: 48 85 c0 test %rax,%rax
9ce: 74 02 je 9d2 <_init+0x12>
9d0: ff d0 callq *%rax
9d2: 48 83 c4 08 add [=13=]x8,%rsp
9d6: c3 retq
Disassembly of section .plt:
00000000000009e0 <.plt>:
9e0: ff 35 6a 15 20 00 pushq 0x20156a(%rip) # 201f50 <_GLOBAL_OFFSET_TABLE_+0x8>
9e6: ff 25 6c 15 20 00 jmpq *0x20156c(%rip) # 201f58 <_GLOBAL_OFFSET_TABLE_+0x10>
9ec: 0f 1f 40 00 nopl 0x0(%rax)
00000000000009f0 <puts@plt>:
9f0: ff 25 6a 15 20 00 jmpq *0x20156a(%rip) # 201f60 <puts@GLIBC_2.2.5>
9f6: 68 00 00 00 00 pushq [=13=]x0
9fb: e9 e0 ff ff ff jmpq 9e0 <.plt>
0000000000000a00 <clock_gettime@plt>:
a00: ff 25 62 15 20 00 jmpq *0x201562(%rip) # 201f68 <clock_gettime@GLIBC_2.17>
a06: 68 01 00 00 00 pushq [=13=]x1
a0b: e9 d0 ff ff ff jmpq 9e0 <.plt>
0000000000000a10 <getpid@plt>:
a10: ff 25 5a 15 20 00 jmpq *0x20155a(%rip) # 201f70 <getpid@GLIBC_2.2.5>
a16: 68 02 00 00 00 pushq [=13=]x2
a1b: e9 c0 ff ff ff jmpq 9e0 <.plt>
0000000000000a20 <__stack_chk_fail@plt>:
a20: ff 25 52 15 20 00 jmpq *0x201552(%rip) # 201f78 <__stack_chk_fail@GLIBC_2.4>
a26: 68 03 00 00 00 pushq [=13=]x3
a2b: e9 b0 ff ff ff jmpq 9e0 <.plt>
0000000000000a30 <PAPI_read_counters@plt>:
a30: ff 25 4a 15 20 00 jmpq *0x20154a(%rip) # 201f80 <PAPI_read_counters>
a36: 68 04 00 00 00 pushq [=13=]x4
a3b: e9 a0 ff ff ff jmpq 9e0 <.plt>
0000000000000a40 <sched_setaffinity@plt>:
a40: ff 25 42 15 20 00 jmpq *0x201542(%rip) # 201f88 <sched_setaffinity@GLIBC_2.3.4>
a46: 68 05 00 00 00 pushq [=13=]x5
a4b: e9 90 ff ff ff jmpq 9e0 <.plt>
0000000000000a50 <PAPI_start_counters@plt>:
a50: ff 25 3a 15 20 00 jmpq *0x20153a(%rip) # 201f90 <PAPI_start_counters>
a56: 68 06 00 00 00 pushq [=13=]x6
a5b: e9 80 ff ff ff jmpq 9e0 <.plt>
0000000000000a60 <PAPI_stop_counters@plt>:
a60: ff 25 32 15 20 00 jmpq *0x201532(%rip) # 201f98 <PAPI_stop_counters>
a66: 68 07 00 00 00 pushq [=13=]x7
a6b: e9 70 ff ff ff jmpq 9e0 <.plt>
0000000000000a70 <malloc@plt>:
a70: ff 25 2a 15 20 00 jmpq *0x20152a(%rip) # 201fa0 <malloc@GLIBC_2.2.5>
a76: 68 08 00 00 00 pushq [=13=]x8
a7b: e9 60 ff ff ff jmpq 9e0 <.plt>
0000000000000a80 <PAPI_strerror@plt>:
a80: ff 25 22 15 20 00 jmpq *0x201522(%rip) # 201fa8 <PAPI_strerror>
a86: 68 09 00 00 00 pushq [=13=]x9
a8b: e9 50 ff ff ff jmpq 9e0 <.plt>
0000000000000a90 <__printf_chk@plt>:
a90: ff 25 1a 15 20 00 jmpq *0x20151a(%rip) # 201fb0 <__printf_chk@GLIBC_2.3.4>
a96: 68 0a 00 00 00 pushq [=13=]xa
a9b: e9 40 ff ff ff jmpq 9e0 <.plt>
0000000000000aa0 <getrusage@plt>:
aa0: ff 25 12 15 20 00 jmpq *0x201512(%rip) # 201fb8 <getrusage@GLIBC_2.2.5>
aa6: 68 0b 00 00 00 pushq [=13=]xb
aab: e9 30 ff ff ff jmpq 9e0 <.plt>
0000000000000ab0 <exit@plt>:
ab0: ff 25 0a 15 20 00 jmpq *0x20150a(%rip) # 201fc0 <exit@GLIBC_2.2.5>
ab6: 68 0c 00 00 00 pushq [=13=]xc
abb: e9 20 ff ff ff jmpq 9e0 <.plt>
0000000000000ac0 <fwrite@plt>:
ac0: ff 25 02 15 20 00 jmpq *0x201502(%rip) # 201fc8 <fwrite@GLIBC_2.2.5>
ac6: 68 0d 00 00 00 pushq [=13=]xd
acb: e9 10 ff ff ff jmpq 9e0 <.plt>
0000000000000ad0 <__fprintf_chk@plt>:
ad0: ff 25 fa 14 20 00 jmpq *0x2014fa(%rip) # 201fd0 <__fprintf_chk@GLIBC_2.3.4>
ad6: 68 0e 00 00 00 pushq [=13=]xe
adb: e9 00 ff ff ff jmpq 9e0 <.plt>
Disassembly of section .plt.got:
0000000000000ae0 <__cxa_finalize@plt>:
ae0: ff 25 12 15 20 00 jmpq *0x201512(%rip) # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
ae6: 66 90 xchg %ax,%ax
Disassembly of section .text:
0000000000000af0 <main>:
af0: 41 57 push %r15
af2: b9 0f 00 00 00 mov [=13=]xf,%ecx
af7: 41 56 push %r14
af9: 41 55 push %r13
afb: 41 54 push %r12
afd: 55 push %rbp
afe: 53 push %rbx
aff: 48 81 ec 78 01 00 00 sub [=13=]x178,%rsp
b06: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
b0d: 00 00
b0f: 48 89 84 24 68 01 00 mov %rax,0x168(%rsp)
b16: 00
b17: 31 c0 xor %eax,%eax
b19: 48 8d 9c 24 e0 00 00 lea 0xe0(%rsp),%rbx
b20: 00
b21: 48 b8 00 00 00 80 07 movabs [=13=]x8000000780000000,%rax
b28: 00 00 80
b2b: 48 c7 84 24 e0 00 00 movq [=13=]x1,0xe0(%rsp)
b32: 00 01 00 00 00
b37: 48 8d 53 08 lea 0x8(%rbx),%rdx
b3b: 48 89 84 24 c8 00 00 mov %rax,0xc8(%rsp)
b42: 00
b43: 31 c0 xor %eax,%eax
b45: 48 89 d7 mov %rdx,%rdi
b48: f3 48 ab rep stos %rax,%es:(%rdi)
b4b: e8 c0 fe ff ff callq a10 <getpid@plt>
b50: 48 89 da mov %rbx,%rdx
b53: be 80 00 00 00 mov [=13=]x80,%esi
b58: 89 c7 mov %eax,%edi
b5a: e8 e1 fe ff ff callq a40 <sched_setaffinity@plt>
b5f: 85 c0 test %eax,%eax
b61: 0f 85 17 03 00 00 jne e7e <main+0x38e>
b67: 0f ae f0 mfence
b6a: 48 8d 74 24 10 lea 0x10(%rsp),%rsi
b6f: bf 02 00 00 00 mov [=13=]x2,%edi
b74: 0f ae f0 mfence
b77: e8 84 fe ff ff callq a00 <clock_gettime@plt>
b7c: 0f 31 rdtsc
b7e: bf 00 fa 00 00 mov [=13=]xfa00,%edi
b83: 0f ae f0 mfence
b86: 48 c1 e2 20 shl [=13=]x20,%rdx
b8a: 49 89 c6 mov %rax,%r14
b8d: 49 09 d6 or %rdx,%r14
b90: e8 db fe ff ff callq a70 <malloc@plt>
b95: 48 8d bc 24 c8 00 00 lea 0xc8(%rsp),%rdi
b9c: 00
b9d: be 02 00 00 00 mov [=13=]x2,%esi
ba2: 49 89 c4 mov %rax,%r12
ba5: e8 a6 fe ff ff callq a50 <PAPI_start_counters@plt>
baa: 85 c0 test %eax,%eax
bac: 0f 85 88 02 00 00 jne e3a <main+0x34a>
bb2: 4d 89 e7 mov %r12,%r15
bb5: 49 8d 84 24 00 fa 00 lea 0xfa00(%r12),%rax
bbc: 00
bbd: 4c 89 e5 mov %r12,%rbp
bc0: c7 45 00 01 00 00 00 movl [=13=]x1,0x0(%rbp)
bc7: 48 83 c5 40 add [=13=]x40,%rbp
bcb: 48 39 e8 cmp %rbp,%rax
bce: 75 f0 jne bc0 <main+0xd0>
bd0: 4c 8d ac 24 d0 00 00 lea 0xd0(%rsp),%r13
bd7: 00
bd8: be 02 00 00 00 mov [=13=]x2,%esi
bdd: 4c 89 ef mov %r13,%rdi
be0: e8 4b fe ff ff callq a30 <PAPI_read_counters@plt>
be5: 85 c0 test %eax,%eax
be7: 0f 85 b8 02 00 00 jne ea5 <main+0x3b5>
bed: 48 8b 84 24 d0 00 00 mov 0xd0(%rsp),%rax
bf4: 00
bf5: 4c 89 e3 mov %r12,%rbx
bf8: 48 89 44 24 08 mov %rax,0x8(%rsp)
bfd: 0f 1f 00 nopl (%rax)
c00: 83 03 09 addl [=13=]x9,(%rbx)
c03: 48 83 c3 40 add [=13=]x40,%rbx
c07: 48 39 dd cmp %rbx,%rbp
c0a: 75 f4 jne c00 <main+0x110>
c0c: 31 d2 xor %edx,%edx
c0e: 48 8d 35 88 04 00 00 lea 0x488(%rip),%rsi # 109d <_IO_stdin_used+0x2d>
c15: bf 01 00 00 00 mov [=13=]x1,%edi
c1a: 31 c0 xor %eax,%eax
c1c: e8 6f fe ff ff callq a90 <__printf_chk@plt>
c21: be 02 00 00 00 mov [=13=]x2,%esi
c26: 4c 89 ef mov %r13,%rdi
c29: e8 02 fe ff ff callq a30 <PAPI_read_counters@plt>
c2e: 85 c0 test %eax,%eax
c30: 0f 85 6f 02 00 00 jne ea5 <main+0x3b5>
c36: 48 8b 8c 24 d0 00 00 mov 0xd0(%rsp),%rcx
c3d: 00
c3e: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
c43: 48 8d 35 e6 04 00 00 lea 0x4e6(%rip),%rsi # 1130 <_IO_stdin_used+0xc0>
c4a: 31 c0 xor %eax,%eax
c4c: bf 01 00 00 00 mov [=13=]x1,%edi
c51: e8 3a fe ff ff callq a90 <__printf_chk@plt>
c56: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
c5d: 00 00 00
c60: 41 0f ae 3c 24 clflush (%r12)
c65: 49 83 c4 40 add [=13=]x40,%r12
c69: 49 39 dc cmp %rbx,%r12
c6c: 75 f2 jne c60 <main+0x170>
c6e: be 02 00 00 00 mov [=13=]x2,%esi
c73: 4c 89 ef mov %r13,%rdi
c76: e8 b5 fd ff ff callq a30 <PAPI_read_counters@plt>
c7b: 85 c0 test %eax,%eax
c7d: 0f 85 22 02 00 00 jne ea5 <main+0x3b5>
c83: 48 8b ac 24 d0 00 00 mov 0xd0(%rsp),%rbp
c8a: 00
c8b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
c90: 41 83 07 09 addl [=13=]x9,(%r15)
c94: 49 83 c7 40 add [=13=]x40,%r15
c98: 49 39 df cmp %rbx,%r15
c9b: 75 f3 jne c90 <main+0x1a0>
c9d: be 02 00 00 00 mov [=13=]x2,%esi
ca2: 4c 89 ef mov %r13,%rdi
ca5: e8 86 fd ff ff callq a30 <PAPI_read_counters@plt>
caa: 85 c0 test %eax,%eax
cac: 0f 85 f3 01 00 00 jne ea5 <main+0x3b5>
cb2: 48 8b 8c 24 d0 00 00 mov 0xd0(%rsp),%rcx
cb9: 00
cba: 48 8d 35 97 04 00 00 lea 0x497(%rip),%rsi # 1158 <_IO_stdin_used+0xe8>
cc1: bf 01 00 00 00 mov [=13=]x1,%edi
cc6: 31 c0 xor %eax,%eax
cc8: 48 89 ea mov %rbp,%rdx
ccb: e8 c0 fd ff ff callq a90 <__printf_chk@plt>
cd0: be 02 00 00 00 mov [=13=]x2,%esi
cd5: 4c 89 ef mov %r13,%rdi
cd8: e8 83 fd ff ff callq a60 <PAPI_stop_counters@plt>
cdd: 85 c0 test %eax,%eax
cdf: 0f 85 72 01 00 00 jne e57 <main+0x367>
ce5: 0f ae f0 mfence
ce8: 0f 31 rdtsc
cea: bf 02 00 00 00 mov [=13=]x2,%edi
cef: 48 c1 e2 20 shl [=13=]x20,%rdx
cf3: 48 89 c3 mov %rax,%rbx
cf6: 48 8d 74 24 20 lea 0x20(%rsp),%rsi
cfb: 48 09 d3 or %rdx,%rbx
cfe: e8 fd fc ff ff callq a00 <clock_gettime@plt>
d03: bf 01 00 00 00 mov [=13=]x1,%edi
d08: 48 be db 34 b6 d7 82 movabs [=13=]x431bde82d7b634db,%rsi
d0f: de 1b 43
d12: 0f ae f0 mfence
d15: 48 8b 4c 24 20 mov 0x20(%rsp),%rcx
d1a: 48 2b 4c 24 10 sub 0x10(%rsp),%rcx
d1f: 48 69 c9 00 ca 9a 3b imul [=13=]x3b9aca00,%rcx,%rcx
d26: 48 03 4c 24 28 add 0x28(%rsp),%rcx
d2b: 48 2b 4c 24 18 sub 0x18(%rsp),%rcx
d30: 48 89 c8 mov %rcx,%rax
d33: 48 c1 f9 3f sar [=13=]x3f,%rcx
d37: 48 f7 ee imul %rsi
d3a: 48 8d 35 3f 04 00 00 lea 0x43f(%rip),%rsi # 1180 <_IO_stdin_used+0x110>
d41: 31 c0 xor %eax,%eax
d43: 48 c1 fa 12 sar [=13=]x12,%rdx
d47: 48 29 ca sub %rcx,%rdx
d4a: e8 41 fd ff ff callq a90 <__printf_chk@plt>
d4f: 48 89 da mov %rbx,%rdx
d52: bf 01 00 00 00 mov [=13=]x1,%edi
d57: 31 c0 xor %eax,%eax
d59: 4c 29 f2 sub %r14,%rdx
d5c: 48 8d 35 53 03 00 00 lea 0x353(%rip),%rsi # 10b6 <_IO_stdin_used+0x46>
d63: e8 28 fd ff ff callq a90 <__printf_chk@plt>
d68: 31 d2 xor %edx,%edx
d6a: 48 8d 35 56 03 00 00 lea 0x356(%rip),%rsi # 10c7 <_IO_stdin_used+0x57>
d71: 31 c0 xor %eax,%eax
d73: bf 01 00 00 00 mov [=13=]x1,%edi
d78: e8 13 fd ff ff callq a90 <__printf_chk@plt>
d7d: 31 ff xor %edi,%edi
d7f: 48 8d 74 24 30 lea 0x30(%rsp),%rsi
d84: e8 17 fd ff ff callq aa0 <getrusage@plt>
d89: 83 f8 ff cmp [=13=]xffffffff,%eax
d8c: 0f 84 d6 00 00 00 je e68 <main+0x378>
d92: 48 8b 8c 24 b8 00 00 mov 0xb8(%rsp),%rcx
d99: 00
d9a: 48 8b 94 24 b0 00 00 mov 0xb0(%rsp),%rdx
da1: 00
da2: 48 8d 35 3e 03 00 00 lea 0x33e(%rip),%rsi # 10e7 <_IO_stdin_used+0x77>
da9: 31 c0 xor %eax,%eax
dab: bf 01 00 00 00 mov [=13=]x1,%edi
db0: e8 db fc ff ff callq a90 <__printf_chk@plt>
db5: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
db9: bf 01 00 00 00 mov [=13=]x1,%edi
dbe: c5 fb 10 0d 12 04 00 vmovsd 0x412(%rip),%xmm1 # 11d8 <_IO_stdin_used+0x168>
dc5: 00
dc6: 48 69 44 24 30 40 42 imul [=13=]xf4240,0x30(%rsp),%rax
dcd: 0f 00
dcf: 48 03 44 24 38 add 0x38(%rsp),%rax
dd4: 48 8d 35 d5 03 00 00 lea 0x3d5(%rip),%rsi # 11b0 <_IO_stdin_used+0x140>
ddb: c4 e1 fb 2a c0 vcvtsi2sd %rax,%xmm0,%xmm0
de0: 48 69 54 24 40 40 42 imul [=13=]xf4240,0x40(%rsp),%rdx
de7: 0f 00
de9: 48 03 54 24 48 add 0x48(%rsp),%rdx
dee: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0
df2: c4 e1 fb 2c c0 vcvttsd2si %xmm0,%rax
df7: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
dfb: c4 e1 fb 2a c2 vcvtsi2sd %rdx,%xmm0,%xmm0
e00: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0
e04: c4 e1 fb 2c d0 vcvttsd2si %xmm0,%rdx
e09: 48 01 c2 add %rax,%rdx
e0c: 31 c0 xor %eax,%eax
e0e: e8 7d fc ff ff callq a90 <__printf_chk@plt>
e13: 31 c0 xor %eax,%eax
e15: 48 8b 8c 24 68 01 00 mov 0x168(%rsp),%rcx
e1c: 00
e1d: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx
e24: 00 00
e26: 75 51 jne e79 <main+0x389>
e28: 48 81 c4 78 01 00 00 add [=13=]x178,%rsp
e2f: 5b pop %rbx
e30: 5d pop %rbp
e31: 41 5c pop %r12
e33: 41 5d pop %r13
e35: 41 5e pop %r14
e37: 41 5f pop %r15
e39: c3 retq
e3a: ba 01 00 00 00 mov [=13=]x1,%edx
e3f: 48 8d 35 47 02 00 00 lea 0x247(%rip),%rsi # 108d <_IO_stdin_used+0x1d>
e46: bf 01 00 00 00 mov [=13=]x1,%edi
e4b: 31 c0 xor %eax,%eax
e4d: e8 3e fc ff ff callq a90 <__printf_chk@plt>
e52: e9 5b fd ff ff jmpq bb2 <main+0xc2>
e57: 48 8d 3d 4a 02 00 00 lea 0x24a(%rip),%rdi # 10a8 <_IO_stdin_used+0x38>
e5e: e8 8d fb ff ff callq 9f0 <puts@plt>
e63: e9 7d fe ff ff jmpq ce5 <main+0x1f5>
e68: 48 8d 3d 62 02 00 00 lea 0x262(%rip),%rdi # 10d1 <_IO_stdin_used+0x61>
e6f: e8 7c fb ff ff callq 9f0 <puts@plt>
e74: e9 19 ff ff ff jmpq d92 <main+0x2a2>
e79: e8 a2 fb ff ff callq a20 <__stack_chk_fail@plt>
e7e: 48 8b 0d 9b 11 20 00 mov 0x20119b(%rip),%rcx # 202020 <stderr@@GLIBC_2.2.5>
e85: ba 18 00 00 00 mov [=13=]x18,%edx
e8a: be 01 00 00 00 mov [=13=]x1,%esi
e8f: 48 8d 3d de 01 00 00 lea 0x1de(%rip),%rdi # 1074 <_IO_stdin_used+0x4>
e96: e8 25 fc ff ff callq ac0 <fwrite@plt>
e9b: bf 01 00 00 00 mov [=13=]x1,%edi
ea0: e8 0b fc ff ff callq ab0 <exit@plt>
ea5: 89 c7 mov %eax,%edi
ea7: e8 d4 fb ff ff callq a80 <PAPI_strerror@plt>
eac: 48 8b 3d 6d 11 20 00 mov 0x20116d(%rip),%rdi # 202020 <stderr@@GLIBC_2.2.5>
eb3: be 01 00 00 00 mov [=13=]x1,%esi
eb8: 48 8d 15 49 02 00 00 lea 0x249(%rip),%rdx # 1108 <_IO_stdin_used+0x98>
ebf: 48 89 c1 mov %rax,%rcx
ec2: 31 c0 xor %eax,%eax
ec4: e8 07 fc ff ff callq ad0 <__fprintf_chk@plt>
ec9: bf 01 00 00 00 mov [=13=]x1,%edi
ece: e8 dd fb ff ff callq ab0 <exit@plt>
ed3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
eda: 00 00 00
edd: 0f 1f 00 nopl (%rax)
0000000000000ee0 <_start>:
ee0: 31 ed xor %ebp,%ebp
ee2: 49 89 d1 mov %rdx,%r9
ee5: 5e pop %rsi
ee6: 48 89 e2 mov %rsp,%rdx
ee9: 48 83 e4 f0 and [=13=]xfffffffffffffff0,%rsp
eed: 50 push %rax
eee: 54 push %rsp
eef: 4c 8d 05 6a 01 00 00 lea 0x16a(%rip),%r8 # 1060 <__libc_csu_fini>
ef6: 48 8d 0d f3 00 00 00 lea 0xf3(%rip),%rcx # ff0 <__libc_csu_init>
efd: 48 8d 3d ec fb ff ff lea -0x414(%rip),%rdi # af0 <main>
f04: ff 15 d6 10 20 00 callq *0x2010d6(%rip) # 201fe0 <__libc_start_main@GLIBC_2.2.5>
f0a: f4 hlt
f0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000000f10 <deregister_tm_clones>:
f10: 48 8d 3d f9 10 20 00 lea 0x2010f9(%rip),%rdi # 202010 <__TMC_END__>
f17: 55 push %rbp
f18: 48 8d 05 f1 10 20 00 lea 0x2010f1(%rip),%rax # 202010 <__TMC_END__>
f1f: 48 39 f8 cmp %rdi,%rax
f22: 48 89 e5 mov %rsp,%rbp
f25: 74 19 je f40 <deregister_tm_clones+0x30>
f27: 48 8b 05 aa 10 20 00 mov 0x2010aa(%rip),%rax # 201fd8 <_ITM_deregisterTMCloneTable>
f2e: 48 85 c0 test %rax,%rax
f31: 74 0d je f40 <deregister_tm_clones+0x30>
f33: 5d pop %rbp
f34: ff e0 jmpq *%rax
f36: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f3d: 00 00 00
f40: 5d pop %rbp
f41: c3 retq
f42: 0f 1f 40 00 nopl 0x0(%rax)
f46: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f4d: 00 00 00
0000000000000f50 <register_tm_clones>:
f50: 48 8d 3d b9 10 20 00 lea 0x2010b9(%rip),%rdi # 202010 <__TMC_END__>
f57: 48 8d 35 b2 10 20 00 lea 0x2010b2(%rip),%rsi # 202010 <__TMC_END__>
f5e: 55 push %rbp
f5f: 48 29 fe sub %rdi,%rsi
f62: 48 89 e5 mov %rsp,%rbp
f65: 48 c1 fe 03 sar [=13=]x3,%rsi
f69: 48 89 f0 mov %rsi,%rax
f6c: 48 c1 e8 3f shr [=13=]x3f,%rax
f70: 48 01 c6 add %rax,%rsi
f73: 48 d1 fe sar %rsi
f76: 74 18 je f90 <register_tm_clones+0x40>
f78: 48 8b 05 71 10 20 00 mov 0x201071(%rip),%rax # 201ff0 <_ITM_registerTMCloneTable>
f7f: 48 85 c0 test %rax,%rax
f82: 74 0c je f90 <register_tm_clones+0x40>
f84: 5d pop %rbp
f85: ff e0 jmpq *%rax
f87: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
f8e: 00 00
f90: 5d pop %rbp
f91: c3 retq
f92: 0f 1f 40 00 nopl 0x0(%rax)
f96: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f9d: 00 00 00
0000000000000fa0 <__do_global_dtors_aux>:
fa0: 80 3d 81 10 20 00 00 cmpb [=13=]x0,0x201081(%rip) # 202028 <completed.7696>
fa7: 75 2f jne fd8 <__do_global_dtors_aux+0x38>
fa9: 48 83 3d 47 10 20 00 cmpq [=13=]x0,0x201047(%rip) # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
fb0: 00
fb1: 55 push %rbp
fb2: 48 89 e5 mov %rsp,%rbp
fb5: 74 0c je fc3 <__do_global_dtors_aux+0x23>
fb7: 48 8b 3d 4a 10 20 00 mov 0x20104a(%rip),%rdi # 202008 <__dso_handle>
fbe: e8 1d fb ff ff callq ae0 <__cxa_finalize@plt>
fc3: e8 48 ff ff ff callq f10 <deregister_tm_clones>
fc8: c6 05 59 10 20 00 01 movb [=13=]x1,0x201059(%rip) # 202028 <completed.7696>
fcf: 5d pop %rbp
fd0: c3 retq
fd1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
fd8: f3 c3 repz retq
fda: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000fe0 <frame_dummy>:
fe0: 55 push %rbp
fe1: 48 89 e5 mov %rsp,%rbp
fe4: 5d pop %rbp
fe5: e9 66 ff ff ff jmpq f50 <register_tm_clones>
fea: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000ff0 <__libc_csu_init>:
ff0: 41 57 push %r15
ff2: 41 56 push %r14
ff4: 49 89 d7 mov %rdx,%r15
ff7: 41 55 push %r13
ff9: 41 54 push %r12
ffb: 4c 8d 25 36 0d 20 00 lea 0x200d36(%rip),%r12 # 201d38 <__frame_dummy_init_array_entry>
1002: 55 push %rbp
1003: 48 8d 2d 36 0d 20 00 lea 0x200d36(%rip),%rbp # 201d40 <__init_array_end>
100a: 53 push %rbx
100b: 41 89 fd mov %edi,%r13d
100e: 49 89 f6 mov %rsi,%r14
1011: 4c 29 e5 sub %r12,%rbp
1014: 48 83 ec 08 sub [=13=]x8,%rsp
1018: 48 c1 fd 03 sar [=13=]x3,%rbp
101c: e8 9f f9 ff ff callq 9c0 <_init>
1021: 48 85 ed test %rbp,%rbp
1024: 74 20 je 1046 <__libc_csu_init+0x56>
1026: 31 db xor %ebx,%ebx
1028: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
102f: 00
1030: 4c 89 fa mov %r15,%rdx
1033: 4c 89 f6 mov %r14,%rsi
1036: 44 89 ef mov %r13d,%edi
1039: 41 ff 14 dc callq *(%r12,%rbx,8)
103d: 48 83 c3 01 add [=13=]x1,%rbx
1041: 48 39 dd cmp %rbx,%rbp
1044: 75 ea jne 1030 <__libc_csu_init+0x40>
1046: 48 83 c4 08 add [=13=]x8,%rsp
104a: 5b pop %rbx
104b: 5d pop %rbp
104c: 41 5c pop %r12
104e: 41 5d pop %r13
1050: 41 5e pop %r14
1052: 41 5f pop %r15
1054: c3 retq
1055: 90 nop
1056: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
105d: 00 00 00
0000000000001060 <__libc_csu_fini>:
1060: f3 c3 repz retq
Disassembly of section .fini:
0000000000001064 <_fini>:
1064: 48 83 ec 08 sub [=13=]x8,%rsp
1068: 48 83 c4 08 add [=13=]x8,%rsp
106c: c3 retq
我的对象大小为 64,我还添加了初始化:
typedef struct _object{
int value;
int pad_0;
int * pad_2;
int * pad_3;
int * pad_4;
int * pad_5;
int * pad_6;
int * pad_7;
int * pad_8;
} object;
object * array;
int arr_size = 1000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
array[i].value = 1;
}
我在 Haswell 上使用类似于 PAPI 的 LIKWID 做了一些实验。我发现对初始化和读取性能计数器的函数的调用可能导致 L1 缓存中的 600 多次替换。由于 L1 缓存只有 512 行,这意味着这些函数可能会驱逐您原本希望在 L1 中的许多行。通过查看比较大的源代码PAPI_start_counters and _internal_hl_read_cnts, it seems to me that these functions may evict many lines from the L1, so the array elements don't survive in the L1 across these calls. I've verified this by using loads instead of stores and counting hits and misses using MEM_LOAD_RETIRED.*
. I think the solution would be to use the RDPMC
instruction. I have not used this instruction directly before. The code snippets here 看看很有用。
或者,您可以在 PAPI_start_counters
/PAPI_read_counters
之后放置两个循环副本,然后从结果中减去一个循环副本的计数。这个方法效果不错。
顺便说一句,当访问的缓存行数大约大于 10 时,L1D.REPLACEMENT
计数器在 Haswell 上似乎相当准确。也许使用 RDPMC
计数会更准确。
从你之前的问题来看,你似乎在使用 Skylake。根据 PAPI event mapping,PAPI_L1_DCM
和 PAPI_L2_TCM
映射到英特尔处理器上的 L1D.REPLACEMENT
和 LONGEST_LAT_CACHE.REFERENCE
性能监控事件。这些在英特尔手册中定义如下:
L1D.REPLACEMENT: Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.
LONGEST_LAT_CACHE.REFERENCE: This event counts core-originated cacheable demand requests that refer to the last level cache (LLC). Demand requests include loads, RFOs, and hardware prefetches from L1D, and instruction fetches from IFU.
无需深入了解这些事件确切发生的时间,这里有三点与您的问题相关的要点:
- 这两个事件都按缓存行粒度计算,而不是 x86 指令或加载 uop 粒度。
- 这些事件可能由于 L1D 硬件预取器而发生。这会影响
miss2
. - 对于使用这些事件(或基于 SnB 的微体系结构上的任何其他事件集)的特定物理或逻辑内核,无法在高速缓存行粒度上计算 L1D 命中数。
在 Skylake 上,还有其他本机事件可用于计算每个加载指令的 L1D 未命中数和命中数。您可以使用 MEM_LOAD_RETIRED.L1_HIT
来计算命中 L1D 的退休加载指令的数量。您可以使用 MEM_INST_RETIRED.ALL_LOADS
-MEM_LOAD_RETIRED.L1_HIT
来计算 L1D 中未命中的退休加载指令的数量。他们似乎没有 PAPI 事件。根据 documentation,您可以将本机事件代码传递给 PAPIF_start_counters
。
另一个问题是我不清楚 PAPIF_start_counters
默认情况下是否只计算内核事件和用户事件的用户事件。好像可以用PAPI_create_eventset
来控制计数domain.
对 PAPI API 的调用也会影响事件计数。您可以尝试使用一个空块来测量它,如下所示:
if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
exit(1);
}
// Nothing.
if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
exit(1);
}
此测量将为您提供可能由于 PAPI 本身而发生的误差的估计值。
此外,我认为您不需要使用 _mm_mfence
。