使用 printf 显示宽字符

Question

我想了解 printf 如何处理宽字符 (wchar_t)。

我制作了以下代码示例：

示例 1：

#include <stdio.h>
#include <stdlib.h>

int     main(void)
{
    wchar_t     *s;

    s = (wchar_t *)malloc(sizeof(wchar_t) * 2);
    s[0] = 42;
    s[1] = 0;
    printf("%ls\n", s);
    free(s);
    return (0);
}

输出：

这里一切正常：我的角色 (*) 显示正确。

示例 2：

我想展示另一种性格。在我的系统上，wchar_t 似乎编码为 4 个字节。所以我试图显示以下字符： É

#include <stdio.h>
#include <stdlib.h>

int     main(void)
{
    wchar_t     *s;

    s = (wchar_t *)malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC389;
    s[1] = 0;
    printf("%ls\n", s);
    free(s);
    return (0);
}

但是这次没有输出，我尝试使用“编码”部分（参见之前的 link）中的许多值 s[0] (0xC389, 201, 0xC9)...但是我从来没有显示 É 字符。我也尝试使用 %S 而不是 %ls.

如果我尝试这样调用 printf：printf("<%ls>\n", s) 打印的唯一字符是 '<'，显示被截断。

为什么我会遇到这个问题？我该怎么办？

Answer 1

一个问题是您正在尝试将单字节编码方案 UTF-8 编码为多字节编码。对于 UTF-8，您使用纯 char.

另请注意，由于您尝试将 UTF-8 序列组合成多字节类型，因此存在 endianness（字节顺序）问题（在内存中 0xC389 可能存储为0x89 和 0xC3，按此顺序）。 And 编译器也会对你的数字进行符号扩展（如果 sizeof(wchar_t) == 4 并且你在调试器中查看 s[0] 它可能是 0xFFFFC389） .

另一个问题是您用来打印的终端或控制台。也许它只是不支持 UTF-8 或您尝试过的其他编码？

Answer 2

为什么我会遇到这个问题？

确保检查 errno 和 printf 的 return 值！

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(void)
{
    wchar_t *s;
    s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC389;
    s[1] = 0;

    if (printf("%ls\n", s) < 0) {
        perror("printf");
    }

    free(s);
    return (0);
}

查看输出：

$ gcc test.c && ./a.out
printf: Invalid or incomplete multibyte or wide character

如何修复

首先，C 程序的默认语言环境是 C（也称为 POSIX），它是纯 ASCII 码。您需要添加对 setlocale 的调用，特别是 setlocale(LC_ALL,"").

如果您的 LC_ALL、LC_CTYPE 或 LANG 环境变量未设置为在空白时允许 UTF-8，则您必须明确 select 语言环境. setlocale(LC_ALL, "C.UTF-8") 适用于大多数系统 - C 是标准的，并且通常实现 C 的 UTF-8 子集。

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>

int main(void)
{
    wchar_t *s;
    s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC389;
    s[1] = 0;

    setlocale(LC_ALL, "");

    if (printf("%ls\n", s) < 0) {
        perror("printf");
    }

    free(s);
    return (0);
}

查看输出：

$ gcc test.c && ./a.out
쎉

之所以打印出错误的字符是因为wchar_t代表的是宽字符（如UTF-32），而不是多字节字符（如UTF-8）。请注意，wchar_t 在 GNU C 库中始终为 32 位宽，但 C 标准并不要求如此。如果您使用 UTF-32BE 编码（即 0x000000C9）初始化字符，那么它会正确打印出来：

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>

int main(void)
{
    wchar_t *s;
    s = (wchar_t *) malloc(sizeof(wchar_t) * 2);
    s[0] = 0xC9;
    s[1] = 0;

    setlocale(LC_ALL, "");

    if (printf("%ls\n", s) < 0) {
        perror("printf");
    }

    free(s);
    return (0);
}

输出：

$ gcc test.c && ./a.out
É

请注意，您还可以通过命令行设置 LC（语言环境）环境变量：

$ LC_ALL=C.UTF-8
$ ./a.out
É

Answer 3

我找到了一种打印宽字符的简单方法。一个关键点是setlocale()

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char *argv[])
{
    setlocale(LC_ALL, "");
    // setlocale(LC_ALL, "C.UTF-8"); // this also works

    wchar_t hello_eng[] = L"Hello World!";
    wchar_t hello_china[] = L"世界, 你好!";
    wchar_t *hello_japan = L"こんにちは日本!";
    printf("%ls\n", hello_eng);
    printf("%ls\n", hello_china);
    printf("%ls\n", hello_japan);

    return 0;
}

使用 printf 显示宽字符

Displaying wide chars with printf

c

encoding

printf

widechar

示例 1：

示例 2：

为什么我会遇到这个问题？

如何修复