strncmp的实现
The implementation of strncmp
为了磨练我的 C 技能,我下载了 eglibc 源代码并遇到了 strncpy。我不明白他为什么要区分 n<=4 的情况并进行 4 次测试。
int
STRNCMP (const char *s1, const char *s2, size_t n)
{
unsigned char c1 = '[=10=]';
unsigned char c2 = '[=10=]';
if (n >= 4)
{
size_t n4 = n >> 2;
do
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
} while (--n4 > 0);
n &= 3;
}
while (n > 0)
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
n--;
}
return c1 - c2;
}
可能与内存布局有关我不知道,请赐教。
这是一个 unrolled loop。以二进制文件稍大为代价,它通过消除每 4 个要比较的字节的 3 个递减、3 个分支和 3 个条件来加快字符串比较。
甚至可以通过使用与 Duff's device 相同的技术进一步优化,但尚不清楚这实际上会更快。从链接页面,
This automatic handling of the remainder may not be the best solution on all systems and compilers – in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and branch prediction on some architectures. When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was an improvement in performance and a noticeable reduction in size of the executable. Therefore, when considering using this code, it may be worth running a few benchmarks to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler.
为了磨练我的 C 技能,我下载了 eglibc 源代码并遇到了 strncpy。我不明白他为什么要区分 n<=4 的情况并进行 4 次测试。
int
STRNCMP (const char *s1, const char *s2, size_t n)
{
unsigned char c1 = '[=10=]';
unsigned char c2 = '[=10=]';
if (n >= 4)
{
size_t n4 = n >> 2;
do
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
} while (--n4 > 0);
n &= 3;
}
while (n > 0)
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '[=10=]' || c1 != c2)
return c1 - c2;
n--;
}
return c1 - c2;
}
可能与内存布局有关我不知道,请赐教。
这是一个 unrolled loop。以二进制文件稍大为代价,它通过消除每 4 个要比较的字节的 3 个递减、3 个分支和 3 个条件来加快字符串比较。
甚至可以通过使用与 Duff's device 相同的技术进一步优化,但尚不清楚这实际上会更快。从链接页面,
This automatic handling of the remainder may not be the best solution on all systems and compilers – in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and branch prediction on some architectures. When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was an improvement in performance and a noticeable reduction in size of the executable. Therefore, when considering using this code, it may be worth running a few benchmarks to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler.