"throughput" 是英特尔按线程还是按内核列出的？

Is the "throughput" listed by Intel per thread or per core?

Intel intrinsics guide 列出的吞吐量是每线程还是每核心？

每个物理内核。

SMT（超线程）仅在您在后端执行端口以外的其他方面遇到瓶颈时才有助于整体吞吐量。如果线程有时因高速缓存未命中或分支未命中而停止，SMT 可以更接近于让执行单元使用新的 uop 来启动每个时钟周期，从而实现列出的吞吐量限制。拥有两个可供乱序调度的指令流可以避免饥饿（停顿），即使一个逻辑核心上的线程卡在等待某事时也是如此。

请注意，您可以从 https://uops.info/, and about what the numbers mean from https://agner.org/ and/or 英特尔的优化手册中获得有关指令时序的更详细信息。

单个指令的“吞吐量”并不能告诉您它是否与其他指令竞争。例如在不同端口（p0 和 p1）上具有 0.5c 吞吐量运行s 的 FMA 比在 Haswell 和 Skylake 等 Intel CPU 上具有 1c 吞吐量（p5）的随机播放。（还有 Ice Lake，如果我们谈论的是在辅助洗牌单元上也不能运行的洗牌。）这就是为什么查看后端 uops 更有用，有多少 uops 和 哪个端口.

另见

latency vs throughput in intel intrinsics
How many CPU cycles are needed for each assembly instruction?

"throughput" 是英特尔按线程还是按内核列出的？

Is the "throughput" listed by Intel per thread or per core?

x86

assembly

sse

simd

intrinsics