在整数 SSE 寄存器中移动高位或低位 64 位的最快方法

Question

仅将较高或较低的 64 位从一个整数 SSE 寄存器移动到另一个寄存器的最快方法是什么？使用 SSE 4.1，可以使用单个 pblendw 指令 (_mm_blend_epi16) 完成。但是旧的 SSE 版本呢？转移并解压？与和或？ movsd 尽管存在旁路延迟？

密切相关的问题：Best way to shuffle 64-bit portions of two __m128i's

Answer 1

不知道最快的，也许是最简单的，

_mm_unpacklo_epi64(_mm_setzero_si128(), x)

[0, x0]

_mm_unpackhi_epi64(_mm_setzero_si128(), x)

[0, x1]

_mm_move_epi64(x)

[x0, 0]

_mm_unpackhi_epi64(x, _mm_setzero_si128())

[x1, 0]

Answer 2

要将低 64 位从 src 移动到 dst，保留 dst 的高 64 位：

movsd dst, src

要将高 64 位从 src 移动到 dst，保留 dst 的低 64 位：

shufps dst, src, E4h

绕过延迟通常只会增加延迟，不会分派或执行或退役资源，因此它们通常只是在比较其他等效序列时才需要考虑的问题（即，如果存在一个停留在整数域中的单指令等效项，您我更愿意将它用于整数运算）。

Answer 3

Agner Fog 的 Optimizing Assembly 指南有一组很好的 table 指令，用于各种数据移动。（第 13.3 节）。

要将两个 regs 的数据合并为一个，您的选项包括：

MOVLHPS   # SSE. Low qword unchanged, high qword from low of source
MOVHLPS   # SSE. Low qword from high of source, high qword unchanged
MOVSD     # SSE2. Low qword from source (register only), high qword unchanged
# memory-source-only insns:
 MOVLPS/D  # SSE1/2.  Low qword from memory, high qword unchanged
 MOVHPS/D  # SSE1/2. High qword from memory, low qword unchanged
SHUFPD    # SSE2. Low qword from any position of destination. high qword from any position of source
PUNPCKLQDQ # SSE2. Low qword unchanged, high qword from low of source
PUNPCKHQDQ # SSE2. Low qword from high of destination, high qword from high of source
MOVQ       # SSE2. Low qword from source, high qword set to zero
PBLENDW    # SSE4.1
PINSRQ     # SSE4.1 (only takes the low64 of src)

描述 copy/pasted 来自 Agner Fog 的 table，他拥有版权。

所以 shufpd 看起来是从另一个 reg 插入 high64 的最佳选择。其他选项将要求它位于 src 的 low64 中（对于 punpcklqdq 或 movlhps）。

在整数 SSE 寄存器中移动高位或低位 64 位的最快方法

Fastest way to move higher or lower 64 bits in integer SSE register

sse

simd

cpu-registers