OR 64 位整数中相邻位的有效方法
Efficient way to OR adjacent bits in 64-bit integer
我想要做的是获取一个由位对组成的 64 位无符号整数,如果相应对中的两个位都为 0 和 1,则从中创建一个包含 0 的 32 位整数。换句话说,转换看起来像这样的东西:
01 00 10 11
变成像这样的东西
1 0 1 1
两个明显的解决方案是暴力循环或查找每个字节 table 然后进行八次查找并将它们组合成最终结果与 OR 和位移但我确定应该是解决这个问题的有效方法。我将在 C++ 中针对 64 位整数执行此操作,但如果有人知道针对较短整数执行此操作的有效方法,我相信我可以弄清楚如何扩大它。
好吧,让我们把它变得更 hacky(可能有问题):
uint64_t x;
uint64_t even_bits = x & 0xAAAAAAAAAAAAAAAAull;
uint64_t odd_bits = x & 0x5555555555555555ull;
现在,我原来的解决方案是这样做的:
// wrong
even_bits >> 1;
unsigned int solution = even_bits | odd_bits;
但是,正如 JackAidley 指出的那样,虽然这会将位对齐在一起,但不会删除中间的空格![=21=]
值得庆幸的是,我们可以使用来自 BMI2 instruction set 的非常有用的 _pext
指令。
u64 _pext_u64(u64 a, u64 m)
- Extract bits from a at the corresponding bit locations specified by mask m to contiguous low bits in dst; the remaining upper bits in dst are set to zero.
solution = _pext_u64(solution, odd_bits);
或者,不使用 &
和 >>
来分隔位,您可以使用提供的掩码在原始数字上使用 _pext
两次(这会将其拆分分成两个连续的 32 位数字),然后简单地 or
结果。
不过,如果您没有访问 BMI2 的权限,我很确定间隙消除仍会涉及一个循环;也许比你最初的想法简单一点。
可能是 x86 架构最快的解决方案 BMI2 instruction set:
#include <stdint.h>
#include <x86intrin.h>
uint32_t calc (uint64_t a)
{
return _pext_u64(a, 0x5555555555555555ull) |
_pext_u64(a, 0xaaaaaaaaaaaaaaaaull);
}
这总共编译为 5 条指令。
如果您没有 pext
并且您仍然希望比简单的方法做得更好,那么这种提取可以表示为位移动的对数(如果您将其概括为长度) :
// OR adjacent bits, destroys the odd bits but it doesn't matter
x = (x | (x >> 1)) & rep8(0x55);
// gather the even bits with delta swaps
x = bitmove(x, rep8(0x44), 1); // make pairs
x = bitmove(x, rep8(0x30), 2); // make nibbles
x = bitmove(x, rep4(0x0F00), 4); // make bytes
x = bitmove(x, rep2(0x00FF0000), 8); // make words
res = (uint32_t)(x | (x >> 16)); // final step is simpler
有:
bitmove(x, mask, step) {
return x | ((x & mask) >> step);
}
repk
只是为了让我可以编写更短的常量。 rep8(0x44) = 0x4444444444444444
等
另外,如果你做有pext
,你可以只用其中一个,这可能更快,至少更短:
_pext_u64(x | (x >> 1), rep8(0x55));
这是一个可移植的 C++ 实现。它似乎在我的简短测试中起作用。去交织代码基于this SO question.
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n & 0x5555555555555555ull) | ((n & 0xAAAAAAAAAAAAAAAAull) >> 1);
// deinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull;
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full;
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull;
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull;
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull;
return x;
}
gcc、clang 和 msvc 都将其编译为大约 30 条指令。
根据评论,有修改可以修改
- 将第一行更改为 select 仅 "odd" 位。
可能(?)改进的代码是:
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n | (n >> 1)) & 0x5555555555555555ull; // single bits
// ... the restdeinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull; // bit pairs
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full; // nibbles
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull; // octets
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull; // halfwords
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull; // words
return x;
}
LUT 方法略有改进(4 次查找而不是 8 次):
计算按位或并每隔一位清零。然后将字节对的位交织在一起以产生四个字节。最后,通过 256 项查找对四个字节(映射到四字上)中的位进行重新排序-table:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL; // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
困难的部分似乎是在 oring 之后打包位。 oring 由以下人员完成:
ored = (x | (x>>1)) & 0x5555555555555555;
(假设 int
足够大,因此我们不必使用后缀)。然后我们可以打包,然后分步打包,先两两打包,四四打包等等:
pack2 = ((ored*3) >> 1) & 0x333333333333;
pack4 = ((ored*5) >> 2) & 0x0F0F0F0F0F0F;
pack8 = ((ored*17) >> 4) & 0x00FF00FF00FF;
pac16 = ((ored*257) >> 8) & 0x0000FFFF0000FFFF;
pack32 = ((ored*65537) >> 16) & 0xFFFFFFFF;
// (or cast to uint32_t instead of the final & 0xFFF...)
打包中发生的事情是,通过乘法,我们将数据与移位的数据结合起来。在您的示例中,我们将进行第一次乘法运算(我将 ored
中的掩码零表示为 o
,另一个 0
(来自原始数据)):
o1o0o1o1
x 11
----------
o1o0o1o1
o1o0o1o1
----------
o11001111
^^ ^^
o10oo11o < these are bits we want to keep.
我们也可以通过 oring 来做到这一点:
ored = (ored | (ored>>1)) & 0x3333333333333333;
ored = (ored | (ored>>2)) & 0x0F0F0F0F0F0F0F0F;
ored = (ored | (ored>>4)) & 0x00FF00FF00FF00FF;
ored = (ored | (ored>>8)) & 0x0000FFFF0000FFFF;
ored = (ored | (ored>>16)) & 0xFFFFFFFF;
// ored = ((uint32_t)ored | (uint32_t)(ored>>16)); // helps some compilers make better code, esp. on x86
当这个问题是新问题时,我做了一些 vectorized versions (godbolt link still with some big design-notes comments) 并做了一些基准测试。我打算花更多的时间在上面,但再也没有回来。发布我所拥有的,以便我可以关闭此浏览器选项卡。 >.< 欢迎改进。
我没有可以测试的 Haswell,因此我无法针对此对 pextr
版本进行基准测试。不过,我确信它更快,因为它只有 4 条快速指令。
*** Sandybridge (i5-2500k, so no hyperthreading)
*** 64bit, gcc 5.2 with -O3 -fno-tree-vectorize results:
TODO: update benchmarks for latest code changes
total cycles, and insn/clock, for the test-loop
This measures only throughput, not latency,
and a bottleneck on one execution port might make a function look worse in a microbench
than it will do when mixed with other code that can keep the other ports busy.
Lower numbers in the first column are better:
these are total cycle counts in Megacycles, and correspond to execution time
but they take frequency scaling / turbo out of the mix.
(We're not cache / memory bound at all, so low core clock = fewer cycles for cache miss doesn't matter).
AVX no AVX
887.519Mc 2.70Ipc 887.758Mc 2.70Ipc use_orbits_shift_right
1140.68Mc 2.45Ipc 1140.47Mc 2.46Ipc use_orbits_mul (old version that right-shifted after each)
718.038Mc 2.79Ipc 716.452Mc 2.79Ipc use_orbits_x86_lea
767.836Mc 2.74Ipc 1027.96Mc 2.53Ipc use_orbits_sse2_shift
619.466Mc 2.90Ipc 816.698Mc 2.69Ipc use_orbits_ssse3_shift
845.988Mc 2.72Ipc 845.537Mc 2.72Ipc use_orbits_ssse3_shift_scalar_mmx (gimped by stupid compiler)
583.239Mc 2.92Ipc 686.792Mc 2.91Ipc use_orbits_ssse3_interleave_scalar
547.386Mc 2.92Ipc 730.259Mc 2.88Ipc use_orbits_ssse3_interleave
The fastest (for throughput in a loop) with AVX is orbits_ssse3_interleave
The fastest (for throughput in a loop) without AVX is orbits_ssse3_interleave_scalar
but obits_x86_lea comes very close.
AVX for non-destructive 3-operand vector insns helps a lot
Maybe a bit less important on IvB and later, where mov-elimination handles mov uops at register-rename time
// Tables generated with the following commands:
// for i in avx.perf{{2..4},{6..10}};do awk '/cycles / {c=; gsub(",", "", c); } /insns per cy/ {print c / 1000000 "Mc " "Ipc"}' *"$i"*;done | column -c 50 -x
// Include 0 and 1 for hosts with pextr
// 5 is omitted because it's not written
几乎可以肯定的最佳版本(带有 BMI2)是:
#include <stdint.h>
#define LOBITS64 0x5555555555555555ull
#define HIBITS64 0xaaaaaaaaaaaaaaaaull
uint32_t orbits_1pext (uint64_t a) {
// a|a<<1 compiles more efficiently on x86 than a|a>>1, because of LEA for non-destructive left-shift
return _pext_u64( a | a<<1, HIBITS64);
}
编译为:
lea rax, [rdi+rdi]
or rdi, rax
movabs rax, -6148914691236517206
pext rax, rdi, rax
ret
所以只有4微秒,关键路径延迟为5c = 3(pext) + 1(or) + 1(lea)。 (英特尔哈斯韦尔)。吞吐量应该是每个周期一个结果(没有循环开销或 loading/storing)。不过,常量的 mov imm
可以从循环中提取出来,因为它没有被破坏。这意味着在吞吐量方面我们每个结果只需要 3 个融合域微指令。
mov r, imm64
并不理想。 (1uop 广播立即 32 位或 8 位到 64 位 reg 是理想的,但没有这样的指令)。在数据存储器中拥有常量是一种选择,但在指令流中内联是很好的。一个 64b 常量占用大量 uop-cache space,这使得使用两个不同掩码 pext
的版本更糟。但是,使用 not
从另一个掩码生成一个掩码可能会有所帮助:movabs
/ pext
/ not
/ pext
/ or
,但与 lea
技巧启用的 4 个相比,这仍然是 5 个。
最好的版本(带有 AVX)是:
#include <immintrin.h>
/* Yves Daoust's idea, operating on nibbles instead of bytes:
original:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
To operate on nibbles,
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs, same as before
Q|= Q>>5 // Intertwine 8 nibbles into 8 bytes
// pshufb as a LUT to re-order the bits within each nibble (to undo the interleave)
// right-shift and OR to combine nibbles
// pshufb as a byte-shuffle to put the 4 bytes we want into the low 4
*/
uint32_t orbits_ssse3_interleave(uint64_t scalar_a)
{
// do some of this in GP regs if not doing two 64b elements in parallel.
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// but VEX-encoded SSE saves mov instructions
__m128i a = _mm_cvtsi64_si128(scalar_a);
// element size doesn't matter, any bits shifted out of element boundaries would have been masked off anyway.
__m128i lshift = _mm_slli_epi64(a, 1);
lshift = _mm_or_si128(lshift, a);
lshift = _mm_and_si128(lshift, _mm_set1_epi32(0xaaaaaaaaUL));
// a = bits: h g f e d c b a (same thing in other bytes)
// lshift = hg 0 fe 0 dc 0 ba 0
// lshift = s 0 r 0 q 0 p 0
// lshift = s 0 r 0 q 0 p 0
__m128i rshift = _mm_srli_epi64(lshift, 5); // again, element size doesn't matter, we're keeping only the low nibbles
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
__m128i nibbles = _mm_or_si128(rshift, lshift);
nibbles = _mm_and_si128(nibbles, _mm_set1_epi8(0x0f) ); // have to zero the high nibbles: the sign bit affects pshufb
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(-1,-1,-1,-1, 14,12,10,8,
-1,-1,-1,-1, 6, 4, 2,0) );
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
// _mm_extract_epi32(ord_bytes, 2); // If operating on two inputs in parallel: SSE4.1 PEXTRD the result from the upper half of the reg.
}
没有 AVX 的最佳版本是一个轻微的修改,一次只适用于一个输入,只使用 SIMD 进行改组。理论上,使用 MMX 而不是 SSE 会更有意义,尤其是。如果我们关心第一代 Core2,其中 64b pshufb 速度很快,但 128b pshufb 不是单周期。无论如何,编译器在 MMX 内部函数方面做得不好。另外,EMMS 很慢。
// same as orbits_ssse3_interleave, but doing some of the math in integer regs. (non-vectorized)
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// VEX-encoded SSE saves mov instructions, so full vector is preferable if building with VEX-encoding
// Use MMX for Silvermont/Atom/Merom(Core2): pshufb is slow for xmm, but fast for MMX. Only 64b shuffle unit?
uint32_t orbits_ssse3_interleave_scalar(uint64_t scalar_a)
{
uint64_t lshift = (scalar_a | scalar_a << 1);
lshift &= HIBITS64;
uint64_t rshift = lshift >> 5;
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
uint64_t nibbles_scalar = (rshift | lshift) & 0x0f0f0f0f0f0f0f0fULL;
// have to zero the high nibbles: the sign bit affects pshufb
__m128i nibbles = _mm_cvtsi64_si128(nibbles_scalar);
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 6,4,2,0));
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
}
对于大部分代码转储的回答,我们深表歉意。在这一点上,我觉得不值得花大量时间讨论比评论已经做的更多的事情。有关其他资源,请参阅 http://agner.org/optimize/ for guides to optimizing for specific microarchitectures. Also the x86 wiki。
我想要做的是获取一个由位对组成的 64 位无符号整数,如果相应对中的两个位都为 0 和 1,则从中创建一个包含 0 的 32 位整数。换句话说,转换看起来像这样的东西:
01 00 10 11
变成像这样的东西
1 0 1 1
两个明显的解决方案是暴力循环或查找每个字节 table 然后进行八次查找并将它们组合成最终结果与 OR 和位移但我确定应该是解决这个问题的有效方法。我将在 C++ 中针对 64 位整数执行此操作,但如果有人知道针对较短整数执行此操作的有效方法,我相信我可以弄清楚如何扩大它。
好吧,让我们把它变得更 hacky(可能有问题):
uint64_t x;
uint64_t even_bits = x & 0xAAAAAAAAAAAAAAAAull;
uint64_t odd_bits = x & 0x5555555555555555ull;
现在,我原来的解决方案是这样做的:
// wrong
even_bits >> 1;
unsigned int solution = even_bits | odd_bits;
但是,正如 JackAidley 指出的那样,虽然这会将位对齐在一起,但不会删除中间的空格![=21=]
值得庆幸的是,我们可以使用来自 BMI2 instruction set 的非常有用的 _pext
指令。
u64 _pext_u64(u64 a, u64 m)
- Extract bits from a at the corresponding bit locations specified by mask m to contiguous low bits in dst; the remaining upper bits in dst are set to zero.
solution = _pext_u64(solution, odd_bits);
或者,不使用 &
和 >>
来分隔位,您可以使用提供的掩码在原始数字上使用 _pext
两次(这会将其拆分分成两个连续的 32 位数字),然后简单地 or
结果。
不过,如果您没有访问 BMI2 的权限,我很确定间隙消除仍会涉及一个循环;也许比你最初的想法简单一点。
可能是 x86 架构最快的解决方案 BMI2 instruction set:
#include <stdint.h>
#include <x86intrin.h>
uint32_t calc (uint64_t a)
{
return _pext_u64(a, 0x5555555555555555ull) |
_pext_u64(a, 0xaaaaaaaaaaaaaaaaull);
}
这总共编译为 5 条指令。
如果您没有 pext
并且您仍然希望比简单的方法做得更好,那么这种提取可以表示为位移动的对数(如果您将其概括为长度) :
// OR adjacent bits, destroys the odd bits but it doesn't matter
x = (x | (x >> 1)) & rep8(0x55);
// gather the even bits with delta swaps
x = bitmove(x, rep8(0x44), 1); // make pairs
x = bitmove(x, rep8(0x30), 2); // make nibbles
x = bitmove(x, rep4(0x0F00), 4); // make bytes
x = bitmove(x, rep2(0x00FF0000), 8); // make words
res = (uint32_t)(x | (x >> 16)); // final step is simpler
有:
bitmove(x, mask, step) {
return x | ((x & mask) >> step);
}
repk
只是为了让我可以编写更短的常量。 rep8(0x44) = 0x4444444444444444
等
另外,如果你做有pext
,你可以只用其中一个,这可能更快,至少更短:
_pext_u64(x | (x >> 1), rep8(0x55));
这是一个可移植的 C++ 实现。它似乎在我的简短测试中起作用。去交织代码基于this SO question.
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n & 0x5555555555555555ull) | ((n & 0xAAAAAAAAAAAAAAAAull) >> 1);
// deinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull;
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full;
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull;
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull;
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull;
return x;
}
gcc、clang 和 msvc 都将其编译为大约 30 条指令。
根据评论,有修改可以修改
- 将第一行更改为 select 仅 "odd" 位。
可能(?)改进的代码是:
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n | (n >> 1)) & 0x5555555555555555ull; // single bits
// ... the restdeinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull; // bit pairs
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full; // nibbles
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull; // octets
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull; // halfwords
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull; // words
return x;
}
LUT 方法略有改进(4 次查找而不是 8 次):
计算按位或并每隔一位清零。然后将字节对的位交织在一起以产生四个字节。最后,通过 256 项查找对四个字节(映射到四字上)中的位进行重新排序-table:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL; // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
困难的部分似乎是在 oring 之后打包位。 oring 由以下人员完成:
ored = (x | (x>>1)) & 0x5555555555555555;
(假设 int
足够大,因此我们不必使用后缀)。然后我们可以打包,然后分步打包,先两两打包,四四打包等等:
pack2 = ((ored*3) >> 1) & 0x333333333333;
pack4 = ((ored*5) >> 2) & 0x0F0F0F0F0F0F;
pack8 = ((ored*17) >> 4) & 0x00FF00FF00FF;
pac16 = ((ored*257) >> 8) & 0x0000FFFF0000FFFF;
pack32 = ((ored*65537) >> 16) & 0xFFFFFFFF;
// (or cast to uint32_t instead of the final & 0xFFF...)
打包中发生的事情是,通过乘法,我们将数据与移位的数据结合起来。在您的示例中,我们将进行第一次乘法运算(我将 ored
中的掩码零表示为 o
,另一个 0
(来自原始数据)):
o1o0o1o1
x 11
----------
o1o0o1o1
o1o0o1o1
----------
o11001111
^^ ^^
o10oo11o < these are bits we want to keep.
我们也可以通过 oring 来做到这一点:
ored = (ored | (ored>>1)) & 0x3333333333333333;
ored = (ored | (ored>>2)) & 0x0F0F0F0F0F0F0F0F;
ored = (ored | (ored>>4)) & 0x00FF00FF00FF00FF;
ored = (ored | (ored>>8)) & 0x0000FFFF0000FFFF;
ored = (ored | (ored>>16)) & 0xFFFFFFFF;
// ored = ((uint32_t)ored | (uint32_t)(ored>>16)); // helps some compilers make better code, esp. on x86
当这个问题是新问题时,我做了一些 vectorized versions (godbolt link still with some big design-notes comments) 并做了一些基准测试。我打算花更多的时间在上面,但再也没有回来。发布我所拥有的,以便我可以关闭此浏览器选项卡。 >.< 欢迎改进。
我没有可以测试的 Haswell,因此我无法针对此对 pextr
版本进行基准测试。不过,我确信它更快,因为它只有 4 条快速指令。
*** Sandybridge (i5-2500k, so no hyperthreading)
*** 64bit, gcc 5.2 with -O3 -fno-tree-vectorize results:
TODO: update benchmarks for latest code changes
total cycles, and insn/clock, for the test-loop
This measures only throughput, not latency,
and a bottleneck on one execution port might make a function look worse in a microbench
than it will do when mixed with other code that can keep the other ports busy.
Lower numbers in the first column are better:
these are total cycle counts in Megacycles, and correspond to execution time
but they take frequency scaling / turbo out of the mix.
(We're not cache / memory bound at all, so low core clock = fewer cycles for cache miss doesn't matter).
AVX no AVX
887.519Mc 2.70Ipc 887.758Mc 2.70Ipc use_orbits_shift_right
1140.68Mc 2.45Ipc 1140.47Mc 2.46Ipc use_orbits_mul (old version that right-shifted after each)
718.038Mc 2.79Ipc 716.452Mc 2.79Ipc use_orbits_x86_lea
767.836Mc 2.74Ipc 1027.96Mc 2.53Ipc use_orbits_sse2_shift
619.466Mc 2.90Ipc 816.698Mc 2.69Ipc use_orbits_ssse3_shift
845.988Mc 2.72Ipc 845.537Mc 2.72Ipc use_orbits_ssse3_shift_scalar_mmx (gimped by stupid compiler)
583.239Mc 2.92Ipc 686.792Mc 2.91Ipc use_orbits_ssse3_interleave_scalar
547.386Mc 2.92Ipc 730.259Mc 2.88Ipc use_orbits_ssse3_interleave
The fastest (for throughput in a loop) with AVX is orbits_ssse3_interleave
The fastest (for throughput in a loop) without AVX is orbits_ssse3_interleave_scalar
but obits_x86_lea comes very close.
AVX for non-destructive 3-operand vector insns helps a lot
Maybe a bit less important on IvB and later, where mov-elimination handles mov uops at register-rename time
// Tables generated with the following commands:
// for i in avx.perf{{2..4},{6..10}};do awk '/cycles / {c=; gsub(",", "", c); } /insns per cy/ {print c / 1000000 "Mc " "Ipc"}' *"$i"*;done | column -c 50 -x
// Include 0 and 1 for hosts with pextr
// 5 is omitted because it's not written
几乎可以肯定的最佳版本(带有 BMI2)是:
#include <stdint.h>
#define LOBITS64 0x5555555555555555ull
#define HIBITS64 0xaaaaaaaaaaaaaaaaull
uint32_t orbits_1pext (uint64_t a) {
// a|a<<1 compiles more efficiently on x86 than a|a>>1, because of LEA for non-destructive left-shift
return _pext_u64( a | a<<1, HIBITS64);
}
编译为:
lea rax, [rdi+rdi]
or rdi, rax
movabs rax, -6148914691236517206
pext rax, rdi, rax
ret
所以只有4微秒,关键路径延迟为5c = 3(pext) + 1(or) + 1(lea)。 (英特尔哈斯韦尔)。吞吐量应该是每个周期一个结果(没有循环开销或 loading/storing)。不过,常量的 mov imm
可以从循环中提取出来,因为它没有被破坏。这意味着在吞吐量方面我们每个结果只需要 3 个融合域微指令。
mov r, imm64
并不理想。 (1uop 广播立即 32 位或 8 位到 64 位 reg 是理想的,但没有这样的指令)。在数据存储器中拥有常量是一种选择,但在指令流中内联是很好的。一个 64b 常量占用大量 uop-cache space,这使得使用两个不同掩码 pext
的版本更糟。但是,使用 not
从另一个掩码生成一个掩码可能会有所帮助:movabs
/ pext
/ not
/ pext
/ or
,但与 lea
技巧启用的 4 个相比,这仍然是 5 个。
最好的版本(带有 AVX)是:
#include <immintrin.h>
/* Yves Daoust's idea, operating on nibbles instead of bytes:
original:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
To operate on nibbles,
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs, same as before
Q|= Q>>5 // Intertwine 8 nibbles into 8 bytes
// pshufb as a LUT to re-order the bits within each nibble (to undo the interleave)
// right-shift and OR to combine nibbles
// pshufb as a byte-shuffle to put the 4 bytes we want into the low 4
*/
uint32_t orbits_ssse3_interleave(uint64_t scalar_a)
{
// do some of this in GP regs if not doing two 64b elements in parallel.
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// but VEX-encoded SSE saves mov instructions
__m128i a = _mm_cvtsi64_si128(scalar_a);
// element size doesn't matter, any bits shifted out of element boundaries would have been masked off anyway.
__m128i lshift = _mm_slli_epi64(a, 1);
lshift = _mm_or_si128(lshift, a);
lshift = _mm_and_si128(lshift, _mm_set1_epi32(0xaaaaaaaaUL));
// a = bits: h g f e d c b a (same thing in other bytes)
// lshift = hg 0 fe 0 dc 0 ba 0
// lshift = s 0 r 0 q 0 p 0
// lshift = s 0 r 0 q 0 p 0
__m128i rshift = _mm_srli_epi64(lshift, 5); // again, element size doesn't matter, we're keeping only the low nibbles
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
__m128i nibbles = _mm_or_si128(rshift, lshift);
nibbles = _mm_and_si128(nibbles, _mm_set1_epi8(0x0f) ); // have to zero the high nibbles: the sign bit affects pshufb
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(-1,-1,-1,-1, 14,12,10,8,
-1,-1,-1,-1, 6, 4, 2,0) );
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
// _mm_extract_epi32(ord_bytes, 2); // If operating on two inputs in parallel: SSE4.1 PEXTRD the result from the upper half of the reg.
}
没有 AVX 的最佳版本是一个轻微的修改,一次只适用于一个输入,只使用 SIMD 进行改组。理论上,使用 MMX 而不是 SSE 会更有意义,尤其是。如果我们关心第一代 Core2,其中 64b pshufb 速度很快,但 128b pshufb 不是单周期。无论如何,编译器在 MMX 内部函数方面做得不好。另外,EMMS 很慢。
// same as orbits_ssse3_interleave, but doing some of the math in integer regs. (non-vectorized)
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// VEX-encoded SSE saves mov instructions, so full vector is preferable if building with VEX-encoding
// Use MMX for Silvermont/Atom/Merom(Core2): pshufb is slow for xmm, but fast for MMX. Only 64b shuffle unit?
uint32_t orbits_ssse3_interleave_scalar(uint64_t scalar_a)
{
uint64_t lshift = (scalar_a | scalar_a << 1);
lshift &= HIBITS64;
uint64_t rshift = lshift >> 5;
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
uint64_t nibbles_scalar = (rshift | lshift) & 0x0f0f0f0f0f0f0f0fULL;
// have to zero the high nibbles: the sign bit affects pshufb
__m128i nibbles = _mm_cvtsi64_si128(nibbles_scalar);
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 6,4,2,0));
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
}
对于大部分代码转储的回答,我们深表歉意。在这一点上,我觉得不值得花大量时间讨论比评论已经做的更多的事情。有关其他资源,请参阅 http://agner.org/optimize/ for guides to optimizing for specific microarchitectures. Also the x86 wiki。