确定代码块需要多少个时钟周期

Determining how many clock cycles a codeblock need

是否有工具或方法可以告诉我一个代码块使用了多少时钟周期? 手动调试和计数对于较大的代码块来说是一种痛苦。

在 x86 上,Intel's IACA (Intel Architecture Code Analyzer 是我所知道的唯一静态分析器。它假设缓存未命中为零,并进行了各种其他简化,但有点用处。

我认为它还假设除最后一个分支外的所有分支都未被采用,因此它可能对具有已采用分支的循环体没有用。

IACA 的数据也有一些错误,例如它认为 shld 在 Sandybridge 上运行缓慢。它确实知道一些不明显的事情,比如 SnB-family CPUs can't micro-fuse 2-register addressing modes.

自 Haswell 更新以来,它基本上已被放弃。与 Haswell 相比,Skylake 可以 运行 在更多执行端口上执行一些指令(请参阅 Agner Fog's instruction tables), but the pipeline is similar enough that the results should be fairly useful. See also other links at the 标记 wiki,包括英特尔的优化手册,以帮助您理解输出。


我喜欢使用这个 iaca.sh 包装器脚本来使 -64 成为默认值(我可以用 -32 覆盖它)。我忘记了我写了多少(可能只是最后的 if (($# >= 1)) 位)以及 LD_LIBRARY_PATH 部分来自哪里。

iaca.sh:

#!/bin/bash
myname=$(realpath "[=10=]")
mypath=$(dirname "$myname")
ld_lib="$LD_LIBRARY_PATH"
app_loc="../lib"

if [ "$LD_LIBRARY_PATH" = "" ]
then
export LD_LIBRARY_PATH="$mypath/$app_loc"
else
export LD_LIBRARY_PATH="$mypath/$app_loc:$LD_LIBRARY_PATH"
fi

if (($# >= 1));then
    exec "$mypath/iaca" -64 "$@"
else
    exec "$mypath/iaca"  # there is no -help, just run with no args for help output
fi

示例:就地前缀和,来自 SIMD prefix sum on Intel cpu:

#include <immintrin.h>

#ifdef IACA_MARKS_OFF
  #define IACA_START
  #define IACA_END
#else
  #include <iacaMarks.h>
#endif

// In-place rewrite an array of values into an array of prefix sums.
// This makes the code simpler, and minimizes cache effects.
int prefix_sum_sse(int data[], int n)
{

//    const int elemsz = sizeof(data[0]);
#define elemsz sizeof(data[0])   // clang-3.5 doesn't allow const int foo = ... as an imm8 arg to intrinsics

    __m128i *datavec = (__m128i*)data;
    const int vec_elems = sizeof(*datavec)/elemsz;
    // to use this for int8/16_t, you still need to change the add_epi32, and the shuffle

    const __m128i *endp = (__m128i*) (data + n - 2*vec_elems);  // pointer to last full vector we can load
    __m128i carry = _mm_setzero_si128();
    for(; datavec <= endp ; datavec += 2) {
        IACA_START
        __m128i x0 = _mm_load_si128(datavec + 0);
        __m128i x1 = _mm_load_si128(datavec + 1); // unroll / pipeline by 1
//      __m128i x2 = _mm_load_si128(datavec + 2);
//      __m128i x3;

        x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, elemsz));
        x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, elemsz));

        x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, 2*elemsz));
        x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, 2*elemsz));

        // more shifting if vec_elems is larger

        x0 = _mm_add_epi32(x0, carry);  // this has to go after the byte-shifts, to avoid double-counting the carry.
        _mm_store_si128(datavec +0, x0); // store first to allow destructive shuffle (e.g. non-avx shufps for FP or pshufb for narrow integers)

        x1 = _mm_add_epi32(_mm_shuffle_epi32(x0, _MM_SHUFFLE(3,3,3,3)), x1);
        _mm_store_si128(datavec +1, x1);

        carry = _mm_shuffle_epi32(x1, _MM_SHUFFLE(3,3,3,3)); // broadcast the high element for next vector
    }
    // FIXME: scalar loop to handle the last few elements
    IACA_END
    return data[n-1];
    #undef elemsz
}

$ gcc -I/opt/iaca-2.1/include -Wall -O3 -c prefix-sum.c -march=nehalem -mtune=haswell
$ iaca.sh prefix-sum.o
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - prefix-sum.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 6.40 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 5.7  | 1.4    1.0  | 1.4    1.0  | 2.0  | 6.3  | 1.0  | 1.3  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | movdqa xmm3, xmmword ptr [rax]
|   1    | 1.0       |     |           |           |     |     |     |     |    | add rax, 0x20
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | movdqa xmm0, xmmword ptr [rax-0x10]
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm1, xmm3
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm1, 0x4
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm3
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm3, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm3, 0x4
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm4, xmm1
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm3, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm4, 0x8
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm0, xmm3
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm4
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm0, 0x8
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm2
|   1    |           | 0.8 |           |           |     | 0.2 |     |     | CP | paddd xmm0, xmm3
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | movaps xmmword ptr [rax-0x20], xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm1, 0xff
|   1    |           | 0.9 |           |           |     | 0.1 |     |     | CP | paddd xmm0, xmm1
|   2^   |           |     | 0.3       | 0.3       | 1.0 |     |     | 0.3 |    | movaps xmmword ptr [rax-0x10], xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xff
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm2, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp rdx, rax
|   0F   |           |     |           |           |     |     |     |     |    | jnb 0xffffffffffffff94
Total Num Of Uops: 20

请注意,uop 总数是 而不是 对前端、ROB 和 4 宽 issue/retire 宽度很重要的融合域 uops。它计算未融合域微指令,这对执行单元(和调度程序)很重要。但这有点愚蠢,因为在未融合的领域中,uop 需要哪个端口最重要,而不是有多少。

这不是最好的例子,因为它在 Haswell 的 shuffle 端口上很容易成为瓶颈。不过,它确实展示了 IACA 如何显示移动消除、微融合存储和宏融合比较和分支。

当有选择时,微指令在端口之间的分配是相当随意的。不要指望它能与真正的硬件相匹配。我认为 IACA 根本没有模仿 ROB/scheduler,真的。此限制和其他限制已在之前的 SO 问题中讨论过。尝试在 IACA 上搜索,因为它是一个相当独特的字符串。