mem_load_uops_retired.l3_miss 和 offcore_response.demand_data_rd.l3_miss.local_dram 事件之间的区别
Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events
我有一个 Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
(Haswell
) 处理器。 AFAIK,mem_load_uops_retired.l3_miss
,计算 DRAM demand
(即 non-prefetch
)数据读取访问的次数。 offcore_response.demand_data_rd.l3_miss.local_dram
,顾名思义,计算针对 DRAM 的 demand
数据读取次数。因此,这两个事件似乎等价(或者至少几乎相同)。但根据以下基准,前者 比后者 频率低得多:
1) 在 C
中循环初始化一个 1000 元素的全局数组:
Performance counter stats for '/home/ahmad/Simple Progs/loop':
1,363 mem_load_uops_retired.l3_miss
1,543 offcore_response.demand_data_rd.l3_miss.local_dram
0.000749574 seconds time elapsed
0.000778000 seconds user
0.000000000 seconds sys
2) 在 Evince 中打开 PDF 文档:
Performance counter stats for '/opt/evince-3.28.4/bin/evince':
936,152 mem_load_uops_retired.l3_miss
1,853,998 offcore_response.demand_data_rd.l3_miss.local_dram
4.346408203 seconds time elapsed
1.644826000 seconds user
0.103411000 seconds sys
3) 运行 Wireshark 5 秒:
Performance counter stats for 'wireshark':
5,161,671 mem_load_uops_retired.l3_miss
8,126,526 offcore_response.demand_data_rd.l3_miss.local_dram
15.713828395 seconds time elapsed
0.904280000 seconds user
0.693906000 seconds sys
4) 运行 Inkscape 中图像的模糊滤镜:
Performance counter stats for 'inkscape':
13,852,121 mem_load_uops_retired.l3_miss
23,475,970 offcore_response.demand_data_rd.l3_miss.local_dram
25.355643897 seconds time elapsed
7.244404000 seconds user
1.019895000 seconds sys
在所有四个 基准测试中,offcore_response.demand_data_rd.l3_miss.local_dram
的频率几乎是mem_load_uops_retired.l3_miss
的两倍。这合理吗?为什么?请告诉我基准测试是否太 复杂 和 粗粒度 !
据我(当前)所知,以下table 显示了 Haswell 上这两个事件之间的差异:
mem_load_uops_retired.l3_miss
offcore_response.demand _data_rd.l3_miss.local_dram
Cacheable Retired Load Uops
Per uop per line
Y
Cacheable Non-Retired Load Uops
N
Y
Uncacheable WC Retired Load Uops
One event per line
N
Uncacheable UC Retired Load Uops
May occur
N
Uncacheable WC or UC Non-Retired Load Uops
N
N
Locked Loads of any type to any memory type
May occur
I don't know
Legacy IO requests
May occur
N
L1D Prefetches
N
Y
L2 Prefetches into L2 or L3
N
N
Software prefetches with no intention for write
N
Y
Page Walk Loads
N
Y
Servicing Unit
Any
Local DRAM
Reliability
May not be reliable
Reliable
现在您应该清楚这些事件通常根本不等同。同时比较这两个事件的计数来推断出有意义的东西也不是一件容易的事。
在您提供的所有示例中,offcore_response.demand_data_rd.l3_miss.local_dram
事件计数大于 mem_load_uops_retired.l3_miss
事件计数。然而,不难想出后者比前者大的真实例子。
In all four benchmarks,
offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as
frequent as mem_load_uops_retired.l3_miss. Is this reasonable?
我认为“几乎两次”这个描述真的只适用于第二个例子,而不适用于其他例子。在没有看到确切的代码和执行环境信息的情况下,我无法评论您显示的数字。
我有一个 Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
(Haswell
) 处理器。 AFAIK,mem_load_uops_retired.l3_miss
,计算 DRAM demand
(即 non-prefetch
)数据读取访问的次数。 offcore_response.demand_data_rd.l3_miss.local_dram
,顾名思义,计算针对 DRAM 的 demand
数据读取次数。因此,这两个事件似乎等价(或者至少几乎相同)。但根据以下基准,前者 比后者 频率低得多:
1) 在 C
中循环初始化一个 1000 元素的全局数组:
Performance counter stats for '/home/ahmad/Simple Progs/loop':
1,363 mem_load_uops_retired.l3_miss
1,543 offcore_response.demand_data_rd.l3_miss.local_dram
0.000749574 seconds time elapsed
0.000778000 seconds user
0.000000000 seconds sys
2) 在 Evince 中打开 PDF 文档:
Performance counter stats for '/opt/evince-3.28.4/bin/evince':
936,152 mem_load_uops_retired.l3_miss
1,853,998 offcore_response.demand_data_rd.l3_miss.local_dram
4.346408203 seconds time elapsed
1.644826000 seconds user
0.103411000 seconds sys
3) 运行 Wireshark 5 秒:
Performance counter stats for 'wireshark':
5,161,671 mem_load_uops_retired.l3_miss
8,126,526 offcore_response.demand_data_rd.l3_miss.local_dram
15.713828395 seconds time elapsed
0.904280000 seconds user
0.693906000 seconds sys
4) 运行 Inkscape 中图像的模糊滤镜:
Performance counter stats for 'inkscape':
13,852,121 mem_load_uops_retired.l3_miss
23,475,970 offcore_response.demand_data_rd.l3_miss.local_dram
25.355643897 seconds time elapsed
7.244404000 seconds user
1.019895000 seconds sys
在所有四个 基准测试中,offcore_response.demand_data_rd.l3_miss.local_dram
的频率几乎是mem_load_uops_retired.l3_miss
的两倍。这合理吗?为什么?请告诉我基准测试是否太 复杂 和 粗粒度 !
据我(当前)所知,以下table 显示了 Haswell 上这两个事件之间的差异:
mem_load_uops_retired.l3_miss | offcore_response.demand _data_rd.l3_miss.local_dram | |
---|---|---|
Cacheable Retired Load Uops | Per uop per line | Y |
Cacheable Non-Retired Load Uops | N | Y |
Uncacheable WC Retired Load Uops | One event per line | N |
Uncacheable UC Retired Load Uops | May occur | N |
Uncacheable WC or UC Non-Retired Load Uops | N | N |
Locked Loads of any type to any memory type | May occur | I don't know |
Legacy IO requests | May occur | N |
L1D Prefetches | N | Y |
L2 Prefetches into L2 or L3 | N | N |
Software prefetches with no intention for write | N | Y |
Page Walk Loads | N | Y |
Servicing Unit | Any | Local DRAM |
Reliability | May not be reliable | Reliable |
现在您应该清楚这些事件通常根本不等同。同时比较这两个事件的计数来推断出有意义的东西也不是一件容易的事。
在您提供的所有示例中,offcore_response.demand_data_rd.l3_miss.local_dram
事件计数大于 mem_load_uops_retired.l3_miss
事件计数。然而,不难想出后者比前者大的真实例子。
In all four benchmarks, offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as frequent as mem_load_uops_retired.l3_miss. Is this reasonable?
我认为“几乎两次”这个描述真的只适用于第二个例子,而不适用于其他例子。在没有看到确切的代码和执行环境信息的情况下,我无法评论您显示的数字。