我们能否使用英特尔的性能计数器衡量成功的存储转发?
Can we measure successful store-forwarding with Intel's performance counters?
是否可以使用最近的 Intel x86 芯片上的性能计数器测量成功的存储转发操作的数量?
我看到 ld_blocks.store_forward
的事件测量 失败 存储转发,但我很清楚是否可以测量成功案例。
我没有看到比你为 SKL 所做的更多的东西,但较早的 uarches 可能有更多细节:
对于 Core2(英特尔混淆地称为核心微架构),the optimization manual 文档(在 B.7 中
英特尔核心微架构的事件比率):
B.7.5.2 4K Aliasing and Store Forwarding Block Detection
- Loads Blocked by Overlapping Store Rate:
LOAD_BLOCK.OVERLAP_STORE/CPU_CLK_UNHALTED.CORE
4K aliasing and store forwarding block are two different scenarios in which loads are
blocked by preceding stores due to different reasons. Both scenarios
are detected by the same event: LOAD_BLOCK.OVERLAP_STORE
. A high value
for “Loads Blocked by Overlapping Store Rate” indicates that either 4K
aliasing or store forwarding block may affect performance
这可能会计算停止和成功的存储转发。 (还有 4k 混叠,所以你需要避免或减去它。)
B.7.5.3 Load Block by Preceding Stores
- Loads Blocked by Unknown Store Address
Rate: LOAD_BLOCK.STA / CPU_CLK_UNHALTED.CORE
A high value for “Loads Blocked by Unknown Store
Address Rate” indicates that loads are frequently blocked by preceding
stores with unknown address and implies performance penalty.
- Loads Blocked by Unknown Store Data Rate:
LOAD_BLOCK.STD / CPU_CLK_UNHALTED.CORE
A high value for “Loads Blocked by Unknown Store
Data Rate” indicates that loads are frequently blocked by preceding
stores with unknown data and implies performance penalty.
这最后两个计数器似乎计算成功的存储转发,但 仅在负载实际必须等待 检测到(可能的)重叠的情况下.
没有记录事件来计算成功的存储转发操作的数量。但是,我为此目的在 Haswell 和 Broadwell 上通过实验确定了一组未记录的事件。特别是,任何事件代码为 0x2 且 umask 为奇数(任何奇数,例如 1)的事件似乎都非常准确地表示成功存储转发的事件,即计数符合预期且标准偏差实际上为零.我认为您可以在以后(甚至更早)的微体系结构中使用相同的事件。同样,记录了这些事件中的 none。
是否可以使用最近的 Intel x86 芯片上的性能计数器测量成功的存储转发操作的数量?
我看到 ld_blocks.store_forward
的事件测量 失败 存储转发,但我很清楚是否可以测量成功案例。
我没有看到比你为 SKL 所做的更多的东西,但较早的 uarches 可能有更多细节:
对于 Core2(英特尔混淆地称为核心微架构),the optimization manual 文档(在 B.7 中 英特尔核心微架构的事件比率):
B.7.5.2 4K Aliasing and Store Forwarding Block Detection
- Loads Blocked by Overlapping Store Rate:
LOAD_BLOCK.OVERLAP_STORE/CPU_CLK_UNHALTED.CORE
4K aliasing and store forwarding block are two different scenarios in which loads are blocked by preceding stores due to different reasons. Both scenarios are detected by the same event:
LOAD_BLOCK.OVERLAP_STORE
. A high value for “Loads Blocked by Overlapping Store Rate” indicates that either 4K aliasing or store forwarding block may affect performance
这可能会计算停止和成功的存储转发。 (还有 4k 混叠,所以你需要避免或减去它。)
B.7.5.3 Load Block by Preceding Stores
- Loads Blocked by Unknown Store Address
Rate: LOAD_BLOCK.STA / CPU_CLK_UNHALTED.CORE
A high value for “Loads Blocked by Unknown Store Address Rate” indicates that loads are frequently blocked by preceding stores with unknown address and implies performance penalty.
- Loads Blocked by Unknown Store Data Rate:
LOAD_BLOCK.STD / CPU_CLK_UNHALTED.CORE
A high value for “Loads Blocked by Unknown Store Data Rate” indicates that loads are frequently blocked by preceding stores with unknown data and implies performance penalty.
这最后两个计数器似乎计算成功的存储转发,但 仅在负载实际必须等待 检测到(可能的)重叠的情况下.
没有记录事件来计算成功的存储转发操作的数量。但是,我为此目的在 Haswell 和 Broadwell 上通过实验确定了一组未记录的事件。特别是,任何事件代码为 0x2 且 umask 为奇数(任何奇数,例如 1)的事件似乎都非常准确地表示成功存储转发的事件,即计数符合预期且标准偏差实际上为零.我认为您可以在以后(甚至更早)的微体系结构中使用相同的事件。同样,记录了这些事件中的 none。