strcasecmp 算法有缺陷吗?
Is the strcasecmp algorithm flawed?
我正在尝试用 C 语言重新实现 strcasecmp
函数,我注意到比较过程中似乎存在不一致之处。
来自man strcmp
The strcmp() function compares the two strings s1 and s2. The locale is not taken into account (for a locale-aware comparison, see strcoll(3)).
It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.
来自man strcasecmp
The strcasecmp() function performs a byte-by-byte comparison of
the strings s1 and s2, ignoring the case of the characters. It returns
an integer less than, equal to, or greater than zero if s1 is found,
respectively, to be less than, to match, or be greater than s2.
int strcmp(const char *s1, const char *s2);
int strcasecmp(const char *s1, const char *s2);
鉴于此信息,我不理解以下代码的结果:
#include <stdio.h>
#include <string.h>
int main()
{
// ASCII values
// 'A' = 65
// '_' = 95
// 'a' = 97
printf("%i\n", strcmp("A", "_"));
printf("%i\n", strcmp("a", "_"));
printf("%i\n", strcasecmp("A", "_"));
printf("%i\n", strcasecmp("a", "_"));
return 0;
}
输出:
-1 # "A" is less than "_"
1 # "a" is more than "_"
2 # "A" is more than "_" with strcasecmp ???
2 # "a" is more than "_" with strcasecmp
看来,如果s1
中的当前字符是字母,则总是将其转换为小写,而不管s2
中的当前字符是否为字母。
有人可以解释这种行为吗?第一行和第三行不应该相同吗?
提前致谢!
PS:
我在 Manjaro 上使用 gcc 9.2.0
。
此外,当我使用 -fno-builtin
标志编译时,我得到的是:
-30
2
2
2
估计是因为程序没有使用gcc的优化函数,问题依旧
A
的ASCII十进制码是65
,_
是95
,a
是97
,所以strcmp()
它正在做它应该做的事情。从字典序上讲 _
小于 a
而大于 A
.
strcasecmp()
会将 A
视为 a
*,并且由于 a
大于 _
,因此输出也是正确的。
*POSIX.1-2008 标准对这些函数(strcasecmp() 和 strncasecmp())的描述:
When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.
其他链接,http://man7.org/linux/man-pages/man3/strcasecmp.3p.html for strcasecmp 表示转换为小写是正确的行为(至少在 POSIX 语言环境中)。
出现这种行为的原因是,如果您使用 strcasecmp 对字符串数组进行排序,则需要它才能获得合理的结果。
否则,如果您尝试使用例如 qsort 对 "A"、"C"、“_”、"b" 进行排序,结果将取决于比较的顺序。
行为正确。
根据 the POSIX str\[n\]casecmp()
specification:
When the LC_CTYPE
category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.
这也是the NOTES section of the Linux man page的一部分:
The POSIX.1-2008 standard says of these functions:
When the LC_CTYPE category of the locale being used is from
the POSIX locale, these functions shall behave as if the
strings had been converted to lowercase and then a byte
comparison performed. Otherwise, the results are unspecified.
为什么?
,仅在字母之间进行不区分大小写的比较并允许所有其他比较具有 "natural" 结果,如在 strcmp()
中所做的那样会破坏排序。
如果'A' == 'a'
(不区分大小写的定义比较)则'_' > 'A'
和'_' < 'a'
("natural"结果中的ASCII字符集)不能同时存在是真的。
It appears that, if the current character in s1 is a letter, it is
always converted to lowercase, regardless of whether the current
character in s2 is a letter or not.
这是正确的 - 这就是 strcasecmp()
函数 应该 做的事情!它是一个 POSIX
函数,而不是 C
标准的一部分,但是来自“The Open Group Base Specifications, Issue 6”:
In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if
the strings had been converted to lowercase and then a byte comparison
performed. The results are unspecified in other locales.
顺便说一句,此行为也与 _stricmp()
函数有关(在 Visual Studio/MSCV 中使用):
The _stricmp function ordinally compares string1 and string2 after
converting each character to lowercase, and returns a value indicating
their relationship.
我正在尝试用 C 语言重新实现 strcasecmp
函数,我注意到比较过程中似乎存在不一致之处。
来自man strcmp
The strcmp() function compares the two strings s1 and s2. The locale is not taken into account (for a locale-aware comparison, see strcoll(3)). It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.
来自man strcasecmp
The strcasecmp() function performs a byte-by-byte comparison of the strings s1 and s2, ignoring the case of the characters. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.
int strcmp(const char *s1, const char *s2);
int strcasecmp(const char *s1, const char *s2);
鉴于此信息,我不理解以下代码的结果:
#include <stdio.h>
#include <string.h>
int main()
{
// ASCII values
// 'A' = 65
// '_' = 95
// 'a' = 97
printf("%i\n", strcmp("A", "_"));
printf("%i\n", strcmp("a", "_"));
printf("%i\n", strcasecmp("A", "_"));
printf("%i\n", strcasecmp("a", "_"));
return 0;
}
输出:
-1 # "A" is less than "_"
1 # "a" is more than "_"
2 # "A" is more than "_" with strcasecmp ???
2 # "a" is more than "_" with strcasecmp
看来,如果s1
中的当前字符是字母,则总是将其转换为小写,而不管s2
中的当前字符是否为字母。
有人可以解释这种行为吗?第一行和第三行不应该相同吗?
提前致谢!
PS:
我在 Manjaro 上使用 gcc 9.2.0
。
此外,当我使用 -fno-builtin
标志编译时,我得到的是:
-30
2
2
2
估计是因为程序没有使用gcc的优化函数,问题依旧
A
的ASCII十进制码是65
,_
是95
,a
是97
,所以strcmp()
它正在做它应该做的事情。从字典序上讲 _
小于 a
而大于 A
.
strcasecmp()
会将 A
视为 a
*,并且由于 a
大于 _
,因此输出也是正确的。
*POSIX.1-2008 标准对这些函数(strcasecmp() 和 strncasecmp())的描述:
When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.
其他链接,http://man7.org/linux/man-pages/man3/strcasecmp.3p.html for strcasecmp 表示转换为小写是正确的行为(至少在 POSIX 语言环境中)。
出现这种行为的原因是,如果您使用 strcasecmp 对字符串数组进行排序,则需要它才能获得合理的结果。
否则,如果您尝试使用例如 qsort 对 "A"、"C"、“_”、"b" 进行排序,结果将取决于比较的顺序。
行为正确。
根据 the POSIX str\[n\]casecmp()
specification:
When the
LC_CTYPE
category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.
这也是the NOTES section of the Linux man page的一部分:
The POSIX.1-2008 standard says of these functions:
When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified.
为什么?
strcmp()
中所做的那样会破坏排序。
如果'A' == 'a'
(不区分大小写的定义比较)则'_' > 'A'
和'_' < 'a'
("natural"结果中的ASCII字符集)不能同时存在是真的。
It appears that, if the current character in s1 is a letter, it is always converted to lowercase, regardless of whether the current character in s2 is a letter or not.
这是正确的 - 这就是 strcasecmp()
函数 应该 做的事情!它是一个 POSIX
函数,而不是 C
标准的一部分,但是来自“The Open Group Base Specifications, Issue 6”:
In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. The results are unspecified in other locales.
顺便说一句,此行为也与 _stricmp()
函数有关(在 Visual Studio/MSCV 中使用):
The _stricmp function ordinally compares string1 and string2 after converting each character to lowercase, and returns a value indicating their relationship.