计算 [1..N] 中前导 1 下方有 K 个零位的整数？（没有 HW POPCNT 的连续范围的 popcount）

Question

我有以下任务：计算 1 和 N 之间有多少个数字恰好有 K 个零非前导位。（例如 7₁₀=111₂ 将有 0 个，4 将有 2 个）

N和K满足条件0≤K，N≤1000000000

这个版本使用 POPCNT，在我的机器上速度足够快：

%include "io.inc"

section .bss
    n resd 1
    k resd 1
    ans resd 1
section .text
global CMAIN
CMAIN:
    GET_DEC 4,n
    GET_DEC 4,k
    mov ecx,1
    mov edx,0
    ;ecx is counter from 1 to n

loop_:
    mov eax, ecx
    popcnt eax,eax;in eax now amount of bits set
    mov edx, 32
    sub edx, eax;in edx now 32-bits set=bits not set
    
    mov eax, ecx;count leading bits
    bsr eax, eax;
    xor eax, 0x1f;
    sub edx, eax
    mov eax, edx
    ; all this lines something like (gcc):
    ; eax=32-__builtin_clz(x)-_mm_popcnt_u32(x);

    cmp eax,[k];is there k non-leading bits in ecx?
    jnz notk
    ;if so, then increment ans
    
    mov edx,[ans]
    add edx,1
    mov [ans],edx
notk:
    ;increment counter, compare to n and loop
    inc ecx
    cmp ecx,dword[n]
    jna loop_
    
    ;print ans
    PRINT_DEC 4,ans
    xor  eax, eax
    ret

在速度方面应该没问题（~0.8 秒），但它没有被接受，因为（我猜）在测试服务器上使用的 CPU 太旧了，所以它表明发生了运行时错误。

我尝试使用 64K * 4 字节查找的预计数技巧 table，但速度不够快：

%include "io.inc"
section .bss
    n resd 1
    k resd 1
    ans resd 1
    wordbits resd 65536; bits set in numbers from 0 to 65536
section .text
global CMAIN
CMAIN:
    mov ebp, esp; for correct debugging
    mov ecx,0
    ;mov eax, ecx
    ;fill in wordbits, ecx is wordbits array index
precount_:
    mov eax,ecx
    xor ebx,ebx
    ;c is ebx, v is eax
    ;for (c = 0; v; c++){
    ;    v &= v - 1; // clear the least significant bit set
    ;}
lloop_:
    mov edx,eax
    dec edx
    and eax,edx
    inc ebx
    test eax,eax
    jnz lloop_
    
    ;computed bits set
    mov dword[wordbits+4*ecx],ebx
    
    inc ecx
    cmp ecx,65536
    jna precount_
    
    ;0'th element should be 0
    mov dword[wordbits],0
    
    GET_DEC 4,edi;n
    GET_DEC 4,esi;k
    
    mov ecx,1
    xor edx,edx
    xor ebp,ebp
    
loop_:
    mov eax, ecx
    ;popcnt eax,eax
    mov edx,ecx
    and eax,0xFFFF 
    shr edx,16
    mov eax,dword[wordbits+4*eax]
    add eax,dword[wordbits+4*edx]
    ;previous lines are to implement absent instruction popcnt.
    ; they simply do eax=wordbits[x & 0xFFFF] + wordbits[x >> 16]
    mov edx, 32
    sub edx, eax
    ;and the same as before: 
    ;non-leading zero bits=32-bits set-__builtin_clz(x)
    mov eax, ecx
    bsr eax, eax
    xor eax, 0x1f
    sub edx, eax
    mov eax, edx

    ;compare to k again to see if this number has exactly k 
    ;non-leading zero bits

    cmp edx, esi
    jnz notk

    ;increment ebp (answer) if so
    mov edx, ebp
    add edx, 1
    mov ebp, edx
    ;and (or) go to then next iteration 
notk:
    inc ecx
    cmp ecx, edi
    jna loop_
    
    ;print answer what is in ebp
    PRINT_DEC 4, ebp
    xor  eax, eax
    ret

（>1 秒）

我应该加快第二个程序的速度（如果是这样，那怎么办？）或者以某种方式用其他（哪个？）指令替换 POPCNT（我想 SSE2 和更早版本应该可用）？

Answer 1

首先，服务器太旧而无法 popcnt 在其他方面会明显变慢，并且有不同的瓶颈。鉴于它有 pshufb 但没有 popcnt，它首先是 Core 2 或 second-gen（Conroe 或 Penryn）。请参阅 Agner Fog 的微架构 PDF（在 https://agner.org/optimize/ 上）。还降低了时钟速度，因此您可以在 CPU 上做的最好的事情可能不足以让 brute-force 工作。

可能存在可以节省大量时间的算法改进，例如注意每 4 个增量通过 00、01、10、11 模式循环低 2 位：2 个零每四个增量出现一次，1 个零发生两次，没有零发生一次。对于每个 >= 4 的数字，这 2 位都低于前导位，因此是计数的一部分。对于 1 和 log2(N) 之间的每个 MSB-position，将其概括为组合公式可能是一种大大减少工作量的方法。处理 2^M 和 N 之间的数字不太明显。

这里的版本：

清理 popcnt 版本，在 i7-6700k @ 3.9GHz 上为 536ms，没有跨迭代的算法优化。对于 k=8，N=1000000000
原始 LUT 版本（每次迭代 2 次加载，无 inter-iteration 优化）：好的运行约 595 毫秒，k=8 时约 610 毫秒，N=1000000000。 Core2Duo (Conroe) @ 2.4GHz：1.69 秒。（在编辑历史中有几个更糟糕的版本，第一个 partial-register 在 Core 2 上停滞。）
（未完成，未编写清理代码）优化的 LUT 版本（展开，并且 high-half/MSB BSR 工作已提升，仅留下 1 次查找（cmp/jcc）迭代），Skylake 上为 210 毫秒，Core 2 @ 2.4GHz 上为 0.58 秒。时间要现实；我们所有的工作，只是错过了最后 2^16 次迭代，其中 MSB 在低 16 位。在外循环中处理任何必要的极端情况，并进行清理，对速度的影响不应超过 1%。
（甚至更未完成）：使用 pcmpeqb / psubb 向量化优化的 LUT 版本（在外循环中使用 psadbw，如 How to count character occurrences using SIMD 显示 - 内部循环减少到计算 fixed-size 数组中与外部循环中计算的值相匹配的字节元素。 就像标量版本一样）。在 Skylake 上为 18 毫秒，在 Core 2 上为 ~0.036 秒。这些时间现在可能包括相当大的启动开销。但是作为 expected/hoped，两者都快了大约 16 倍。
直方图 wordbits table 一次（可能在您生成它时）。无需搜索 64kiB 来查找匹配字节，只需查找每个 outer-loop 迭代的答案！对于大 N，这应该让你快数千倍。（尽管当 N 不是 64K 的倍数时，你仍然需要处理低 1..64K 和部分运行ge。）

为了有效地测量更快的版本，您可以在整个过程中重复循环，这样整个过程仍然需要一些可测量的时间，比如半秒。（因为它是 asm，没有编译器会优化重复执行相同的 N,k 的工作。）或者你可以在程序内部进行计时，如果你知道 TSC 频率，则使用 rdtsc。但是能够在整个过程中轻松使用 perf stat 很好，所以我会继续这样做（取出 printf 并制作一个静态 executable 以进一步减少启动开销）。

您似乎在询问 micro-optimizing brute-force 方法仍然分别检查每个数字。（尽管可以对 32 - clz - popcnt == k 的实现方式进行重大优化。）

还有其他通常更快的 popcnt 方法，例如类似于 How to count the number of set bits in a 32-bit integer? 中的 bithacks。但是当你有一个 lot 的 popcounting 要做在一个紧密的循环中（足以在缓存中保持查找 table 热）时，LUT 可能很好。

如果您有快速的 SSSE3 pshufb，可能值得使用它在 XMM 寄存器（auto-vectorizing 循环）中并行地为四个双字执行 SIMD popcount，甚至更好带有 AVX2 的 YMM 寄存器。（First-gen Core2 有 pshufb 但在第二代 Core2 之前它不是单一 uop。仍然可能值得。）

或者更好的是，对于给定的 high-half 个数字，使用 SIMD 计算与我们正在寻找的内容匹配的 LUT 元素。

强力检查数字的连续运行ges 开启了 LUT 策略的主要优化：数字的高 n 位每 2^n 增量仅更改一次。 所以你可以从 inner-most 循环中提升这些位的计数。这也值得使用较小的 table（适合 L1d 缓存）。

说起来，你的64k * 4 table是256KiB，你的二级缓存的大小。这意味着它可能必须在每次循环时从 L3 进入。您的桌面 CPU 应该有足够的 L3 带宽（并且访问模式由于增量而连续），现代服务器具有更大的 L2，但是几乎没有理由不使用字节 LUT（popcnt（-1 ) 只有 32)。，并且 movzx 字节加载与 mov 双字加载一样便宜

; General LUT lookup with two 16-bit halves
    movzx  edx, cx            ; low 16 bits
    mov    eax, ecx
    shr    eax, 16            ; high 16 bits
    movzx  edx, byte [wordbits + edx]
    add     dl,      [wordbits + eax]
      ; no partial-reg stall for reading EDX after this, on Intel Sandybridge and later
      ; on Core 2, set up so you can cmp al,dl later to avoid it

在 Intel CPU 太旧以至于不支持 popcnt 上，这将导致 partial-register 停止。用 cmp al, dl 代替下一个比较。（在 bsr 结果上使用 lea 或 add 或 sub，而不是 popcount LUT加载，这样你就可以避免 partial-register 停顿。）

通常您希望使用较小的 LUT，例如每步可能 11 位，因此 3 步处理整个 32 位数字（2^11 = 2048 字节，32k L1d 的一小部分）。但是使用这种顺序访问模式，硬件预取可以处理它并完全隐藏延迟，尤其是当 L1d 预取将命中 L2 时。同样，这很好，因为此循环不涉及此查找 table 以外的内存。在正常情况下，查找 table 会更糟糕，因为每个 popcount 之间会发生大量其他工作，或者您在缓存中有任何其他您不想驱逐的有价值数据。

针对 Skylake (i7-6700k) 进行了优化：即使每次迭代访问 2 个 LUT：3.9GHz 时为 0.600 秒

对比0.536 秒与 popcnt。 提升 high-half LUT 查找（可能还有 32 常量）甚至可能让该版本更快。

注意：一个CPU太老以至于没有popcnt会和Skylake明显不同。为Skylake优化这个有点愚蠢的，除非你更进一步并最终击败 Skylake 上的 popcnt 版本，如果我们可以通过嵌套循环提升 BSR 工作，并且内部循环对整个运行ge 来自 2^m .. 2^(m+1)-1 的数字（固定为 64k 运行ge，因此您还可以提升高半 popcnt LUT 查找）。 popcnt_low(i) == 根据 k、popcnt_high(i) 和 clz(i).

计算的一些常量

3 大事情对 Skylake 来说非常重要（其中一些与旧的 CPU 相关，包括避免因 front-end 原因采取 b运行ches）：

避免 cmp/jcc 在 Intel Skylake-derived CPU 上使用 up-to-date 微码 CPU 触及 32 字节边界，因为英特尔通过禁用此类行的 uop 缓存来减轻 JCC 错误：32-byte aligned routine does not fit the uops cache

这查看反汇编并决定是否使指令更长（例如使用 lea eax, [dword -1 + edx] 强制 4 字节位移而不是较小的 disp8。）以及是否使用 align 32 在循环的顶部。
No-increment比increment更常见，Intel CPUs只能运行采取b运行ches在 1/clock。（但由于 Haswell 在另一个端口上有第二个执行单元，可以运行 predicted-not-taken b运行ches。）将 jne notk 更改为 je yesk 到下面的块跳回的功能。 dec ecx / jnz .loop 的 Tail-duplication / else 下降到 jmp print_and_exit 帮助了 微小的 数量而不是跳回到之后je yesk.

它很少被采用（并且具有足够一致的模式），所以它不会经常错误预测，所以 setnz al / add ebx, eax 可能会更糟。
优化 32 - clz - popcnt == k 检查，利用 bsr 给你的事实 31-clz。所以31-clz - (popcnt-1) = 32-clz-popcnt.
由于我们比较的是 == k，因此可以进一步运行到 popcnt-1 + k == 31-clz。
当我们使用 LUT 进行 popcount 时，而不是必须在端口 1 上运行的 popcnt 指令，我们可以使用像 lea edx, [edx + esi - 1] 这样的 3 分量（慢速）LEA做 popcnt-1+k。由于它有 3 个组件（2 个寄存器和一个位移，在寻址模式下有 2 个 + 符号），它只能在端口 1 上运行（具有 3 个周期延迟），与 bsr 竞争（和 popcnt，如果我们正在使用它）。

利用 lea 一般情况下保存的指令，即使在 popcnt 版本中也是如此。使用 macro-fused 1-uop dec/jnz 而不是 inc + cmp/jne，将循环向下计数到 0 也是如此。（我还没有尝试计算 L1d HW 预取是否在那个方向上工作得更好；popcnt 版本不关心但 LUT 版本可能。）

无需 io.inc 即可移植，只需使用 hard-coded N 和 k 以及 printf 进行输出。这是不是“干净”的代码，例如像 %define wordbits edi 这样令人讨厌的 hack，我通过使用索引寻址模式而不是 [reg + disp32] 来测试改变 b运行ches 的对齐方式来访问数组。这恰好起到了作用，让几乎所有的微指令都来自 DSB（微指令缓存）而不是 MITE，即避免了 JCC 勘误减速。另一种方法是使指令更长，将 cmp/je 和 dec/jnz 推过 32 字节边界。（或者更改循环开始的对齐方式。）Uop-cache 提取发生在 up-to-6 微指令行中，如果您最终得到只有几个微指令的行，则可能成为瓶颈。（Skylake 的 loop-buffer aka LSD 也被微代码禁用，以修复早期的错误；Intel 在 Skylake 上的大错误比大多数设计都多。）

%use SMARTALIGN
alignmode p6, 64

section .bss
 wordbits: resb 65536
;    n resd 1
;    k resd 1
    ans resd 1
section .rodata
  n: dd 1000000000
  k: dd 8
  print_fmt: db `ans: %d\n`, 0

section .text

global main
main:            ; no popcnt version
    push  ebp
    push  edi    ; save some call-preserved registers
    push  esi
    push  ebx

    mov   edi, wordbits
%define wordbits edi             ; dirty hack, use indexed addressing modes instead of reg+disp32.
                                 ; Avoids Skylake JCC erratum problems, and is is slightly better on Core2 with good instruction scheduling
    ;fill in wordbits, ecx is wordbits array index
    mov   ecx, 1     ; leave wordbits[0] = 0
.init_loop:
    mov   eax,ecx
    xor   ebx,ebx
  .popc_loop:
      lea   edx, [eax-1]
      inc   ebx
      and   eax,edx         ; v &= v - 1; // blsr
      jnz  .popc_loop

    ;computed bits set
    mov [wordbits + ecx], bl

    inc ecx
    cmp ecx,65536
    jb .init_loop       ; bugfix: array out of bounds with jna: stores to wordbits[65536]


;    GET_DEC 4,n
;    GET_DEC 4,k
    mov   ecx, [n]      ; ecx counts from n down to 1
;    mov   esi, [k]
    xor   ebx, ebx      ; ebx = ans

    mov   esi, 1
    sub   esi, [k]      ; 1-k
align 32
.loop:
    ;popcnt eax, ecx
    movzx  eax, cx
    mov    ebp, ecx         ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
    movzx  edx, byte [wordbits + eax]
    shr    ebp, 16
;    xor eax, eax        ; break false dependency, or just let OoO exec hide it after breaking once per iter
    bsr    eax, ecx         ; eax = 31-lzcnt for non-zero ecx
;    sub    edx, esi         ; sub now avoids partial-reg stuff.  Could have just used EBX to allow BL.
    add eax, esi           ; Add to BSR result seems slightly better on Core2 than sub from popcnt
    add     dl,      [wordbits + ebp]   ; we don't read EDX, no partial-register stall even on P6-family

                        ;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
    cmp  al, dl         ; 31-clz+(1-k)  == popcount.  or  31-clz == popcnt - (1-k)
    je .yesk          ; not-taken is the more common fast path
 .done_inc:
    dec ecx
    jnz .loop         ; }while(--n >= 0U)

.print_and_exit:
    ;print ans
;    PRINT_DEC 4,ans
    push  ebx
    push  print_fmt
extern printf
    call  printf
    add   esp, 8

    pop  ebx
    pop  esi
    pop  edi
    pop  ebp
    xor  eax, eax
    ret

align 8
.yesk:
   inc  ebx
;   jmp  .done_inc           ; tail duplication is a *tiny* bit faster
   dec  ecx
   jnz  .loop
   jmp  .print_and_exit

这是第 3 版，已更新以避免 partial-register 对 Core 2 (Conroe) 的惩罚。磨合~~1.78s~~ 1.69s 与 3.18s。有时在 Skylake 上仍然一样快，但更常见的是 610 毫秒而不是 594 毫秒。我的 Core 2 上没有性能计数器访问权限； perf 太旧了，无法完全支持，而且我没有针对最后启动的内核的 perf。

（Godbolt 版本 1 的反汇编和性能结果：https://godbolt.org/z/ox7e8G）

在我的 Linux 台式机上，i7-6700k，3.9GHz。（EPP = balance_performance，不是完整的性能，所以它显然不想加速到 4.2GHz。）我不需要 sudo 来使用 perf 因为我设置了 /proc/sys/kernel/perf_event_paranoid = 0 . 我使用 taskset -c 3 只是为了避免 CPU 迁移 single-threaded 工作负载。

# Results from version 1, not the Core2-friendly version.
# Version 3 sometimes runs this fast, but more often ~610ms
# Event counts are near identical for both, except cycles, but uops_issue and executed are mysteriously lower, like 9,090,858,203 executed.
$ nasm -felf32 foo.asm -l/dev/stdout &&
    gcc -m32 -no-pie -fno-pie -fno-plt foo.o 
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 2 ./a.out 
ans: 12509316
ans: 12509316

 Performance counter stats for './a.out' (2 runs):

            597.78 msec task-clock                #    0.999 CPUs utilized            ( +-  0.12% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                62      page-faults               #    0.103 K/sec                    ( +-  0.81% )
     2,328,038,096      cycles                    #    3.894 GHz                      ( +-  0.12% )
     2,000,637,322      branches                  # 3346.789 M/sec                    ( +-  0.00% )
         1,719,724      branch-misses             #    0.09% of all branches          ( +-  0.02% )
    11,015,217,584      instructions              #    4.73  insn per cycle           ( +-  0.00% )
     9,148,164,159      uops_issued.any           # 15303.609 M/sec                   ( +-  0.00% )
     9,102,818,982      uops_executed.thread      # 15227.753 M/sec                   ( +-  0.00% )

   (from a separate run):
      9,204,430,548      idq.dsb_uops              # 15513.249 M/sec                   ( +-  0.00% )
         1,008,922      idq.mite_uops             #    1.700 M/sec                    ( +- 20.51% )


          0.598156 +- 0.000760 seconds time elapsed  ( +-  0.13% )

这大约是 3.93 fused-domain (front-end) uops/clock。所以我们非常接近 4/clock front-end 宽度。

使用 popcnt:

您的原始文件（GET_DEC 被加载常量替换）运行在我的桌面上用了 1.3 秒，k=8 N=1000000000。这个版本用了大约 0.54 秒运行s。我的原始版本甚至不是 b运行ches 对齐的最坏情况（另一个版本大约是 1.6 秒），尽管因为我确实必须改变一些东西它可能与你的机器不同。

我主要使用与上面相同的优化来保存 uops 并帮助循环内的 front-end。（但我先做了这个，所以它缺少一些优化。）

align 32
.loop:
    mov    eax, ecx
    popcnt eax,eax
    lea    edx, [dword eax - 32 + 31]  ; popcnt - 32  =  -(bits not set)
                   ; dword displacement pads the cmp/jnz location to avoid the JCC erratum penalty on Intel

;    xor eax, eax         ; break false dependency, or just let OoO exec hide it after breaking once per iter
    bsr eax, ecx         ; eax = 31-lzcnt
;    xor eax, 0x1f        ; eax = lzcnt (for non-zero x)
    ; want:  32-__builtin_clz(x)-_mm_popcnt_u32(x)  = (31-clz) + 1-popcnt = (31-clz) - (popcnt-1)
    sub eax, edx

    cmp eax, esi  ;is there k non-leading bits in ecx?
%if 0
    jnz .notk
    inc ebx       ;if so, then increment ans
.notk:
%else
    jz .yesk      ; not-taken is the more common fast path
 .done_inc:
%endif
    dec ecx
    jnz .loop   ; }while(--n >= 0U)
    
    ;print ans
;    PRINT_DEC 4,ans
    push  ebx
    push  print_fmt
extern printf
    call  printf
    add   esp, 8

    pop  ebx
    pop  esi
    xor  eax, eax
    ret

.yesk:
   inc  ebx
   jmp  .done_inc         ;; TODO: tail duplication

（未完成）具有不变 clz(x) 和 high-half popcount

的内循环

这个版本运行在我的 2.4GHz Core 2 Duo E6600 (Conroe) 上仅需 0.58 秒，微架构与您的 Xeon 3050 2.13GHz 相同。
在我的 Skylake 上用了 210 毫秒。

它完成了 大部分 的工作，只缺少对 N < 65536（或较大 N 的低 65536，其中 MSB 位于低半部分）的清理，并且可能缺少处理外循环中的其他几个极端情况。但是内部循环完全支配了 run-time，而且它不必运行更多所以这些时间应该是现实的。

它仍然brute-force检查每一个数字，但是大部分per-number依赖于高半部分的工作是循环不变的并且被提升了。假设 non-zero 高半，但只有 2^16 个数字的 MSB 在低 16 位。缩小到低 12 位或 14 位意味着更少的清理，以及更小部分的 LUT 循环可以在 L1d 保持热度。

%use SMARTALIGN
alignmode p6, 64

section .bss
align 4096
 wordbits: resb 65536
;    n resd 1
;    k resd 1
;    ans resd 1
section .rodata
  ;n: dd 0x40000000        ; low half zero, maybe useful to test correctness for a version that doesn't handle that.
  n:  dd 1000000000 ; = 0x3b9aca00
  k: dd 8
  print_fmt: db `ans: %d\n`, 0

section .text
global main

align 16
main:
main_1lookup:
    push  ebp
    push  edi    ; save some call-preserved registers
    push  esi
    push  ebx

    mov   edi, wordbits
;%define wordbits edi             ; dirty hack, use indexed addressing modes instead of reg+disp32.
                                 ; actually slightly worse on Skylake: causes un-lamination of cmp bl, [reg+reg],
                                 ; although the front-end isn't much of a bottleneck anymore
                                 ; also seems pretty much neutral to use disp32+reg on Core 2, maybe reg-read stalls or just not a front-end bottleneck
    ;fill in wordbits, ecx is wordbits array index
    mov   ecx, 1     ; leave wordbits[0] = 0
.init_loop:
    mov   eax,ecx
    xor   ebx,ebx
  .popc_loop:
      lea   edx, [eax-1]
      inc   ebx
      and   eax,edx         ; v &= v - 1; // blsr
      jnz  .popc_loop

    ;computed bits set
    mov [wordbits + ecx], bl

    inc ecx
    cmp ecx,65536
    jb .init_loop


;    GET_DEC 4,n
;    GET_DEC 4,k
    mov   ecx, [n]      ; ecx counts from n down to 1
;    mov   esi, [k]
    xor   esi, esi      ; ans

    mov   ebp, 1
    sub   ebp, [k]      ; 1-k
align 32
.outer:
    mov    eax, ecx         ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
    shr    eax, 16
;    xor eax, eax        ; break false dependency, or just let OoO exec hide it after breaking once per iter
    bsr    ebx, ecx         ; eax = 31-lzcnt for non-zero ecx
         ;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
         ; 31-clz+(1-k)  == popcount.  or  31-clz == popcnt - (1-k)
         ; 31-cls+(1-k) - popcount(hi(x)) == popcount(lo(x))
    add    ebx, ebp
    sub     bl, byte [wordbits + eax]

    ;movzx  edx, cx
    lea    edx, [ecx - 4]   ; TODO: handle cx < 4 making this wrap
    movzx  edx, dx
    and    ecx, -65536      ; clear low 16 bits, which we're processing with the inner loop.
align 16
  .low16:
    cmp   bl, [wordbits + edx + 0]
    je    .yesk0
  .done_inc0:
    cmp   bl, [wordbits + edx + 1]
    je    .yesk1
  .done_inc1:
    cmp   bl, [wordbits + edx + 2]
    je    .yesk2
  .done_inc2:
    cmp   bl, [wordbits + edx + 3]
    je    .yesk3
  .done_inc3:

; TODO: vectorize with pcmpeqb / psubb / psadbw!!
; perhaps over fewer low bits to only use 16kiB of L1d cache
    
    sub  edx, 4
    jae  .low16        ; }while(lowhalf-=4 doesn't wrap)

   sub   ecx, 65536
   ja   .outer
; TODO: handle ECX < 65536 initially or after handling leading bits.  Probably with BSR in the inner loop


.print_and_exit:
    ;print ans
;    PRINT_DEC 4,ans
    push  esi
    push  print_fmt
extern printf
    call  printf
    add   esp, 8

    pop  ebx
    pop  esi
    pop  edi
    pop  ebp
    xor  eax, eax
    ret

align 16
%assign i 0
%rep 4
;align 4
.yesk%+i:
   inc  esi
   jmp  .done_inc%+i
%assign i  i+1
%endrep
  ; could use a similar %rep block for the inner loop

     ; attempt tail duplication?
     ; TODO: skip the next cmp/jcc when jumping back.
     ; Two in a row will never both be equal

;   dec  ecx
;   jnz  .loop
;   jmp  .print_and_exit

Skylake 性能结果：

(update after outer-loop over-count on first iter bugfix, ans: 12497876)

ans: 12498239        # This is too low by a bit vs. 12509316
                     # looks reasonable given skipping cleanup

            209.46 msec task-clock                #    0.992 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                62      page-faults               #    0.296 K/sec                  
       813,311,333      cycles                    #    3.883 GHz                    
     1,263,086,089      branches                  # 6030.123 M/sec                  
           824,103      branch-misses             #    0.07% of all branches        
     2,527,743,287      instructions              #    3.11  insn per cycle         
     1,300,567,770      uops_issued.any           # 6209.065 M/sec                  
     2,299,321,355      uops_executed.thread      # 10977.234 M/sec                 

(from another run)
        37,150,918      idq.dsb_uops              #  174.330 M/sec                  
     1,266,487,977      idq.mite_uops             # 5942.976 M/sec

       0.211235157 seconds time elapsed

       0.209838000 seconds user
       0.000000000 seconds sys

请注意 uops_issued.any 与 idq.DSB_uops + idq.MITE_uops 大致相同 - 如果我们使用 EDI 作为指针来保存 code-size、uops_issued.any 会高得多，因为从 micro + macro-fused cmp+jcc.

中分离了索引寻址模式

同样有趣的是 b运行ch 未命中率更低；也许展开有助于在 IT-TAGE 预测器 table.

中更好地分布历史

SSE2 SIMD

也未完成，没有处理极端情况或清理，但我认为做的工作量大约是正确的。

与 How to count character occurrences using SIMD 不同，我们匹配的数组对匹配发生的频率有已知的限制，所以它恰好（主要是？）不做嵌套循环是安全的，只是一个 2^ 14 (16384) 次迭代循环展开 2，然后将字节计数器扩展为双字。至少对于 k=8.

总计数为 12507677，略低于 12509316（正确的 N=1000000000，k=8），但我还没有检查这是否都是因为没有执行 1..16384，或者如果我在任何地方都失去了任何计数。

您可以展开外部循环迭代，以便为每个负载使用每个 XMM 向量两次或 4 次。（通过顺序访问 L1d 缓存中的数组，通过每次加载执行更多 ALU 工作可能会让我们稍微快一些，但不会快很多。）通过设置 2 或 4 个向量来匹配 2 或 4 个不同的高半部分，您可以在内循环中花费更长的时间。可能我们可以比每个时钟 1 compare/accumulate 快一点。但这可能运行成为 Core 2 上的（冷）寄存器读取瓶颈。

以下版本只是进行标准展开。

;;;;;  Just the loop from main_SSE2, same init stuff and print as main_1lookup
align 32
.outer:
    mov    eax, ecx         ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
    shr    eax, 16-2

;    xor eax, eax        ; break false dependency, or just let OoO exec hide it after breaking once per iter
    bsr    ebx, ecx         ; eax = 31-lzcnt for non-zero ecx
         ;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
         ; 31-clz+(1-k)  == popcount.  or  31-clz == popcnt - (1-k)
         ; 31-cls+(1-k) - popcount(hi(x)) == popcount(lo(x))
    add    ebx, ebp
    movzx  edx, al
;    movzx  edx, byte [wordbits + edx]
    sub    bl, byte [wordbits + edx]
    shr    eax, 8            ; high part is more than 16 bits if low is 14, needs to be broken up
    sub    bl, byte [wordbits + eax]
;    movzx  eax, byte [wordbits + eax]
;    add    eax, edx
;    sub    ebx, eax

    movzx  eax,  bl
    movd   xmm7, eax
    pxor   xmm0, xmm0
    pxor   xmm1, xmm1    ; 2 accumulators
    pshufb xmm7, xmm0    ; broadcast byte to search for.
      ;;  Actually SSSE3, but it only takes a few more insns to broadcast a byte with just SSE2.  
      ;; e.g. imul eax, 0x01010101 / movd / pshufd

    ;movzx  edx, cx
;    lea    edx, [ecx - 4]   ; TODO: handle cx < 4 making this wrap
;    movzx  edx, dx
    and    ecx, -16384      ; clear low bits, which we're processing with the inner loop.

    mov    edx, wordbits     ; quick and dirty, just loop forward over the array
     ;; FIXME: handle non-zero CX on first outer loop iteration, maybe loop backwards so we can go downwards toward 0,
     ;; or calculate an end-pointer if we can use that without register-read stalls on Core 2.
     ;; Also need to handle the leftover part not being a multiple of 32 in size
     ;; So maybe just make a more-flexible copy of this loop and peel the first outer iteration (containing that inner loop)
     ;;  if the cleanup for that slows down the common case of doing exactly 16K 
align 16
  .low14:
    movdqa  xmm2, [edx]
    movdqa  xmm3, [edx + 16]
 ds   pcmpeqb xmm2, xmm7           ; extra prefixes for padding for Skylake JCC erratum: 18ms vs. 25ms
 ds   psubb   xmm0, xmm2
    ds add     edx, 32
 cs   pcmpeqb xmm3, xmm7
 cs   psubb   xmm1, xmm3

  ; hits are rare enough to not wrap counters?
  ; TODO: may need an inner loop to accumulate after 256 steps if every other 32nd element is a match overflowing some SIMD element
    cmp    edx, wordbits + 16384
    jb   .low14

   pxor   xmm7, xmm7
   psadbw xmm0, xmm7
   psadbw xmm1, xmm7       ; byte -> qword horizontal sum
   paddd  xmm0, xmm1       ; reduce to 1 vector
   movhlps xmm1, xmm0
   paddd  xmm0, xmm1       ; hsum the low/high counts
   movd   eax, xmm0
   add    esi, eax         ; sum in scalar (could sink this out)

   sub   ecx, 16384
   ja   .outer
; TODO: handle ECX < 65536 initially or after handling leading bits.  Probably with BSR in the inner loop

可能可以只将 PADDD 放入向量累加器，并且只将 hsum 设为循环外的标量，但我们可能需要更多的自由向量 regs？

Answer 2

这是算法优化的尝试。

我。 [0; 范围内所需整数的数量； 2 ** 楼层（log2（N）））

所有这些整数都小于 N，因此我们只需要检查其中有多少整数在前导一位之后恰好有 K 个零位。

对于位长 n 的整数，有 n - 1 个可能的位置来放置我们的零（前导一位下方的位）。因此，位长 n 所需整数的数量是从 n - 1 个位置中选择 k 个零的方法数（无重复，无序）。我们可以使用 binomial coefficient 公式计算：

n! / (k! * (n - k)!)

如果我们使用 32 位整数，那么 n 的最大可能值为 31（k 也是如此）。 31 的阶乘仍然很大，即使是 64 位数字也放不下，因此我们必须执行重复除法（可以 constexpr 在编译时预先计算）。

为了得到整数的总数，我们计算 n 从 1 到 floor(log2(N)) 的二项式系数并将它们相加。

二． [2 ** floor(log2(N)); 范围内所需整数的数量； N]

从前导一位之后的位开始。并应用以下算法：

如果当前位为零，那么我们无法对该位做任何事情（它必须为零，如果我们将其更改为一，则整数值变得大于N), 所以我们简单地减少我们的零位预算 K 并移动到下一位。
如果当前位是1，那么我们可以假装它是0。现在，剩余的低位有效位的任意组合都将适合 N 以下的范围。获取二项式系数值，以计算出有多少种方法可以从剩余位数中选择剩余的零数并添加到总数中。

一旦我们运行位或 K 变为零，算法就会停止。在这一点上，如果 K 等于剩余的位数，这意味着我们可以将它们清零以获得所需的整数，因此我们将总计数加一（将 N 本身计数到总数）。或者如果 K 为零而所有剩余位均为一，那么我们也可以将 N 计入总数。

代码：

#include <stdio.h>
#include <chrono>

template<typename T>
struct Coefficients {
  static constexpr unsigned size_v = sizeof(T) * 8;

  // Zero-initialize.
  // Indexed by [number_of_zeros][number_of_bits]
  T value[size_v][size_v] = {};

  constexpr Coefficients() {
    // How many different ways we can choose k items from n items
    // without order and without repetition.
    //
    // n! / k! (n - k)!

    value[0][0] = 1;
    value[0][1] = 1;
    value[1][1] = 1;

    for(unsigned i = 2; i < size_v; ++i) {
      value[0][i] = 1;
      value[1][i] = i;

      T r = i;

      for(unsigned j = 2; j < i; ++j) {
        r = (r * (i - j + 1)) / j;
        value[j][i] = r;
      }

      value[i][i] = 1;
    }
  }
};


template<typename T>
__attribute__((noinline)) // To make it easier to benchmark
T count_combinations(T max_value, T zero_bits) {
  if( max_value == 0 )
    return 0;

  constexpr int size = sizeof(T) * 8;
  constexpr Coefficients<T> coefs;
  // assert(zeros_bits < size)

  int bits = size - __builtin_clz(max_value);

  T total = 0;

  // Count all-ones count.
#pragma clang loop vectorize(disable)
  for(int i = 0; i < bits - 1; ++i) {
    total += coefs.value[zero_bits][i];
  }

  // Count interval [2**bits, max_value]
  bits -= 1;
  T mask = T(1) << bits;
  max_value &= ~mask;      // Remove leading bit
  mask = mask >> 1;

#pragma clang loop vectorize(disable)
  while( zero_bits && zero_bits < bits ) {
    if( max_value & mask ) {
      // If current bit is one, then we can pretend that it is zero
      // (which would only make the value smaller, which means that
      // it would still be < max_value) and grab all combinations of
      // zeros within the remaining bits.
      total += coefs.value[zero_bits - 1][bits - 1];

      // And then stop pretending it's zero and continue as normal.
    } else {
      // If current bit is zero, we can't do anything about it, just
      // have to spend a zero from our budget.

      zero_bits--;
    }

    max_value &= ~mask;
    mask = mask >> 1;
    bits--;
  }

  // At this point we don't have any more zero bits, or we don't
  // have any more bits at all.

  if( (zero_bits == bits) ||
      (zero_bits == 0 && max_value == ((mask << 1) - 1)) ) {
    total++;
  }

  return total;
}

int main() {
  using namespace std::chrono;

  unsigned count = 0;
  time_point t0 = high_resolution_clock::now();

  for(int i = 0; i < 1000; ++i) {
    count |= count_combinations<unsigned>(1'000'000'000, 8);
  }
  time_point t1 = high_resolution_clock::now();

  auto duration = duration_cast<nanoseconds>(t1 - t0).count();

  printf("result = %u, time = %lld ns\n", count, duration / 1000);

  return 0;
}

结果（对于 N=1'000'000'000，K=8，运行ning on i7-9750H）：

result = 12509316, time = 35 ns

如果在运行时计算二项式系数，则需要 ~3.2 µs。

计算 [1..N] 中前导 1 下方有 K 个零位的整数？（没有 HW POPCNT 的连续范围的 popcount）

Count integers in [1..N] with K zero bits below the leading 1? (popcount for a contiguous range without HW POPCNT)

optimization

x86

nasm

hammingweight

micro-optimization

针对 Skylake (i7-6700k) 进行了优化：即使每次迭代访问 2 个 LUT：3.9GHz 时为 0.600 秒

使用 popcnt:

（未完成）具有不变 clz(x) 和 high-half popcount

SSE2 SIMD

计算 [1..N] 中前导 1 下方有 K 个零位的整数？ （没有 HW POPCNT 的连续范围的 popcount）

Count integers in [1..N] with K zero bits below the leading 1? (popcount for a contiguous range without HW POPCNT)

optimization

x86

nasm

hammingweight

micro-optimization

针对 Skylake (i7-6700k) 进行了优化：即使每次迭代访问 2 个 LUT：3.9GHz 时为 0.600 秒

使用 popcnt:

（未完成）具有不变 clz(x) 和 high-half popcount

SSE2 SIMD

计算 [1..N] 中前导 1 下方有 K 个零位的整数？（没有 HW POPCNT 的连续范围的 popcount）