AArch64:比较 256 位无符号整数
AArch64: compare 256-bit unsigned integers
在学习Arm NEON指令集的过程中,我尝试实现了256位数字的比较(A <= B)。下面是我最终得到的实现,但我怀疑我的方法是好的。也许有一些更明智、更优化的方法来比较大数?任何建议和意见将不胜感激。
// v1:v0 = number A
// v11:v10 = number B
// out: x0 = 0 if A <= B
// Check A >= B (element by element) - Result1
cmhs v13.4s, v1.4s, v11.4s
cmhs v12.4s, v0.4s, v10.4s
// Check A > B (element by element) - Result2
cmhi v11.4s, v1.4s, v11.4s
cmhi v10.4s, v0.4s, v10.4s
// Narrow down Result2 vector to 64 bits
xtn v10.4h, v10.4s
xtn2 v10.8h, v11.4s
xtn v10.8b, v10.8h
// Narrow down Result1 vector to 64 bits
xtn v11.4h, v12.4s
xtn2 v11.8h, v13.4s
xtn v11.8b, v11.8h
not v11.8b, v11.8b
// Compare elements in Result1 and Result2 as scalars
// if Result2 has all bits set in higher elements than Result1,
// then A > B
cmhi d9, d10, d11
mov x0, v9.d[0]
我反转了布尔值,用 uzp1
替换了 xtn/xtn2
对,re-scheduled 指令。
cmhi v12.4s, v10.4s, v0.4s
cmhi v13.4s, v11.4s, v1.4s
cmhi v10.4s, v0.4s, v10.4s
cmhi v11.4s, v1.4s, v11.4s
uzp1 v12.8h, v12.8h, v13.8h
uzp1 v10.8h, v10.8h, v11.8h
xtn v12.8b, v12.8h
xtn v10.8b, v10.8h
cmhi d10, d10, d12
mov x0, v10.d[0]
在学习Arm NEON指令集的过程中,我尝试实现了256位数字的比较(A <= B)。下面是我最终得到的实现,但我怀疑我的方法是好的。也许有一些更明智、更优化的方法来比较大数?任何建议和意见将不胜感激。
// v1:v0 = number A
// v11:v10 = number B
// out: x0 = 0 if A <= B
// Check A >= B (element by element) - Result1
cmhs v13.4s, v1.4s, v11.4s
cmhs v12.4s, v0.4s, v10.4s
// Check A > B (element by element) - Result2
cmhi v11.4s, v1.4s, v11.4s
cmhi v10.4s, v0.4s, v10.4s
// Narrow down Result2 vector to 64 bits
xtn v10.4h, v10.4s
xtn2 v10.8h, v11.4s
xtn v10.8b, v10.8h
// Narrow down Result1 vector to 64 bits
xtn v11.4h, v12.4s
xtn2 v11.8h, v13.4s
xtn v11.8b, v11.8h
not v11.8b, v11.8b
// Compare elements in Result1 and Result2 as scalars
// if Result2 has all bits set in higher elements than Result1,
// then A > B
cmhi d9, d10, d11
mov x0, v9.d[0]
我反转了布尔值,用 uzp1
替换了 xtn/xtn2
对,re-scheduled 指令。
cmhi v12.4s, v10.4s, v0.4s
cmhi v13.4s, v11.4s, v1.4s
cmhi v10.4s, v0.4s, v10.4s
cmhi v11.4s, v1.4s, v11.4s
uzp1 v12.8h, v12.8h, v13.8h
uzp1 v10.8h, v10.8h, v11.8h
xtn v12.8b, v12.8h
xtn v10.8b, v10.8h
cmhi d10, d10, d12
mov x0, v10.d[0]