strlen AVX-512 __builtin_ctz 无效值
strlen AVX-512 __builtin_ctz invalid value
我用 avx-512 指令编写了 strlen 函数,这是我的源代码
size_t avx512_strlen(const char * s) {
__m512i vec0, vec1;
unsigned long long mask;
const char * ptr = s;
vec0 = _mm512_setzero_epi32();
while (1) {
vec1 = _mm512_loadu_si512(s);
mask = _mm512_cmpeq_epi8_mask(vec0, vec1);
if(mask != 0) {
mask = __builtin_ctz(mask);
return (s-ptr) + mask;
}
s += 64;
}
return s-ptr;
}
'__builtin_ctz(mask)'的值有问题,返回值不正确。事实上,这个函数不能计算空终止符(0x00)在最后一次检查中的位置
例如,我有这个字符串
char str[] = "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE";
这个字符串的长度是(360)但是这个函数returns(352)的问题来自'__builtin_ctz'部分。在执行“__builtin_ctz”之前,提供的掩码是正确的,它是
0001110100010001000100010000000000000000000000000000000000000000
在最后一次检查中,我们检查了 320 个字符并且 __builtin_ctz 必须 returns (40)(正如您在掩码中看到的那样,我们将 40 个零计数到第一个“1”并提供mask 是正确的,'__builtin_ctz' 算错了!
有什么问题?
__builtin_ctz
在 unsigned int
上运行,这在任何 x86 平台上都可能是 32 位。同时,unsigned long long
在任何 x86 平台上都可能是 64 位。所以你的面具在这一行被截断:
mask = __builtin_ctz(mask);
由于低32位全为0,the result is undefined (per GCC):
Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.
(尽管未定义,352 - 320 = 32
是 "number of trailing 0 bits in a 32-bit zero integer." 的合理答案)
您可能打算改用 __builtin_ctzll(mask)
。这应该能让你得到正确的计数。
我用 avx-512 指令编写了 strlen 函数,这是我的源代码
size_t avx512_strlen(const char * s) {
__m512i vec0, vec1;
unsigned long long mask;
const char * ptr = s;
vec0 = _mm512_setzero_epi32();
while (1) {
vec1 = _mm512_loadu_si512(s);
mask = _mm512_cmpeq_epi8_mask(vec0, vec1);
if(mask != 0) {
mask = __builtin_ctz(mask);
return (s-ptr) + mask;
}
s += 64;
}
return s-ptr;
}
'__builtin_ctz(mask)'的值有问题,返回值不正确。事实上,这个函数不能计算空终止符(0x00)在最后一次检查中的位置
例如,我有这个字符串
char str[] = "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE";
这个字符串的长度是(360)但是这个函数returns(352)的问题来自'__builtin_ctz'部分。在执行“__builtin_ctz”之前,提供的掩码是正确的,它是
0001110100010001000100010000000000000000000000000000000000000000
在最后一次检查中,我们检查了 320 个字符并且 __builtin_ctz 必须 returns (40)(正如您在掩码中看到的那样,我们将 40 个零计数到第一个“1”并提供mask 是正确的,'__builtin_ctz' 算错了!
有什么问题?
__builtin_ctz
在 unsigned int
上运行,这在任何 x86 平台上都可能是 32 位。同时,unsigned long long
在任何 x86 平台上都可能是 64 位。所以你的面具在这一行被截断:
mask = __builtin_ctz(mask);
由于低32位全为0,the result is undefined (per GCC):
Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.
(尽管未定义,352 - 320 = 32
是 "number of trailing 0 bits in a 32-bit zero integer." 的合理答案)
您可能打算改用 __builtin_ctzll(mask)
。这应该能让你得到正确的计数。