理解gnu libc的strcmp函数

Question

这是我在 glibc 中找到的 strcmp 函数：

int
STRCMP (const char *p1, const char *p2)
{
  const unsigned char *s1 = (const unsigned char *) p1;
  const unsigned char *s2 = (const unsigned char *) p2;
  unsigned char c1, c2;

  do
    {
      c1 = (unsigned char) *s1++;
      c2 = (unsigned char) *s2++;
      if (c1 == '[=10=]')
        return c1 - c2;
    }
  while (c1 == c2);

  return c1 - c2;
}

这是一个非常简单的函数，其中 while 的主体使用 *s1 和 *s2 的值启动 c1 和 c2 并继续直到c1 是 nul 或 c1 和 c2 的值相等，然后 returns c1 和 c2 之间的差值.

我不明白的是 s1 和 s2 变量的用法。我的意思是，除了它们是 unsigned char 之外，它们也像两个参数 p1 和 p2 一样是 const，所以为什么不直接使用 p1 和p2 在 while 体内并施放它们？在这种情况下，使用这两个额外变量是否会使函数以某种方式更加优化？因为这是我在 github:

上找到的适用于 FreeBSD 的相同功能

int
strcmp(const char *s1, const char *s2)
{
    while (*s1 == *s2++)
        if (*s1++ == '[=11=]')
            return (0);
    return (*(const unsigned char *)s1 - *(const unsigned char *)(s2 - 1));
}

在他们的版本中，他们甚至懒得使用任何额外的变量。

提前感谢您的回答。

PS: 我在网上搜索过这个具体的事实，然后在这里问，但我没有得到任何东西。

我还想知道 glibc 使用这些额外变量而不是直接在 while 中强制转换参数 p1 和 p2 是否有任何特殊原因.

Answer 1

你当然是对的。其中一个演员应该足够了。特别是如果指针被转换，转换检索到的值是一个空操作。

这里是用 gcc -O3 编译的 x86-64 用于不必要的转换：

STRCMP:
.L4:
        addq    , %rdi
        movzbl  -1(%rdi), %eax
        addq    , %rsi
        movzbl  -1(%rsi), %edx
        testb   %al, %al
        je      .L7
        cmpb    %dl, %al
        je      .L4
        subl    %edx, %eax
        ret
.L7:
        movzbl  %dl, %eax
        negl    %eax
        ret

这是没有不必要演员表的那个：

STRCMP:
.L4:
        addq    , %rdi
        movzbl  -1(%rdi), %eax
        addq    , %rsi
        movzbl  -1(%rsi), %edx
        testb   %al, %al
        je      .L7
        cmpb    %dl, %al
        je      .L4
        subl    %edx, %eax
        ret
.L7:
        movzbl  %dl, %eax
        negl    %eax
        ret

它们是相同的

但是有一个陷阱，现在主要是历史意义。如果 char 是符号和符号表示是 而不是 二进制补码，

*(const unsigned char *)p1

和

(unsigned char)*p1

不等同。前者重新解释位模式，而后者使用模运算转换值。这仅具有历史意义，因为甚至 GCC 也不支持 任何没有 2 的补码符号表示的体系结构。并且是移植最多的编译器。

Answer 2

What i didn't understand is the use of s1 and s2 variables. I mean other than the fact that they are unsigned char they are also const like the 2 arguments p1 and p2, so why not just use the p1 and p2 inside the body of while and cast them ?

为了可读性；让我们人类更容易维护代码。

如果你看glibc源码，代码更倾向于可读性而不是简洁的表达。这似乎是一项好政策，因为 30 多年来它一直保持相关性和活力（积极维护）。

Does in this case using those 2 extra variables make the function somehow more optimized?

不，一点也不。

I would also like to know if there are any particular reason why glibc used those extra variables instead of casting the parameters p1 and p2 directly inside while.

仅供阅读。

作者知道使用的C编译器应该可以很好地优化这段代码。（并且很容易证明是这种情况，只需查看编译器生成的代码。对于 GCC，您可以使用 -S 选项，或者您可以使用 binutils 的 objdump -d 检查一个目标文件或二进制可执行文件。）

请注意，由于与 isspace()、isalpha() 等完全相同的原因，需要强制转换为 unsigned char：必须将比较的字符代码视为 unsigned char 以获得正确的结果。

理解gnu libc的strcmp函数

Understanding the strcmp function of gnu libc

c

linux

glibc

strcmp