为什么 wctype.h 中的函数在没有 setlocale() 的情况下无法工作？

Question

我的设置：glibc 2.24、gcc 6.2.0、UTF-8 环境。

考虑以下示例：

#include <wchar.h>
#include <wctype.h>
#include <locale.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wchar_t wc = L'я'; /* 00000100 01001111 */
  if (iswlower(wc)) return 0;
  return 1;
}

编译并运行它：

$ gcc test.c
$ ./a.out; echo $?
0

现在再次删除 setlocale() 和运行。结果不同：

$ gcc test.c
$ ./a.out; echo $?
1

从技术上讲，这里不需要 setlocale()，因为 wctype.h 中的函数使用具有固定编码的宽字符。（不言而喻，如果我们希望 ctype.h 中的函数能够正确处理非 ASCII 字符，并且如果我们使用 wchar.h 中的字符转换函数来设置外部编码，那么 setlocale() 是必需的.)

为什么没有 setlocale() 这个例子就不能工作？

Answer 1

C 标准说：

7.25 Wide character classification and mapping utilities <wctype.h>

...

The behavior of these functions is affected by the LC_CTYPE category of the current locale.

此外（5.2.1字符集）

Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters.

然后（7.19 通用定义<stddef.h>）

wchar_t which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales

所以可能有很多扩展字符集，每个语言环境一个。因此，wchar_t 编码可能依赖于区域设置，因为编码是一组整数代码和一组字符之间的映射，而后者可能取决于区域设置。

鉴于上述情况，<wctype.h>必须依赖于语言环境。否则标准将不得不强制要求有一个独立于语言环境的扩展字符集。

在此特定示例中，宽字符常量 L'я'（某些整数代码）的值可能对应也可能不对应 C 语言环境下扩展字符集的任何成员。

至于 gcc 和 glibc 的特定行为，它们在任何语言环境下总是使用 Unicode/ISO10646/UCS4 作为扩展字符集以简单起见。但是，它们不在 C 语言环境下对扩展字符进行分类，因为它们不必像标准所允许的那样进行分类。（接下来是一个疯狂的猜测）完整的 Unicode 分类表很大，只需要 ASCII 的程序不必为它们的使用付费。

为什么 wctype.h 中的函数在没有 setlocale() 的情况下无法工作？

Why functions from wctype.h do not work without setlocale()?

c

glibc

wchar-t

widechar

setlocale