如何在用 UTF8 编码的 C 中打印 unicode 字符？

Question

我正在尝试打印放大镜 (http://www.fileformat.info/info/unicode/char/1f50e/index.htm)，但出现此错误：

[niko@dev1 ncurses]$ gcc -o utf8 -std=c99 $(ncursesw5-config --cflags --libs) utf8.c 
utf8.c: In function ‘main’:
utf8.c:12:10: error: \ud83d is not a valid universal character
   printw("\ud83ddd0e\n");         // escaped Unicode 
          ^
[niko@dev1 ncurses]$ cat utf8.c
#include <locale.h>
#include <curses.h>
#include <stdlib.h>


int main (int argc, char *argv[])
{
  setlocale(LC_ALL, "");

  initscr();

  printw("\ud83ddd0e\n");         // escaped Unicode 

  getch();
  endwin();

  return EXIT_SUCCESS;
}

这里有什么问题？例如，如果我有一个十进制编码，对于放大镜来说是 55357 ，我将如何在 printf 中将它打印到 ncurses 屏幕？（没有使用wchar_t因为它浪费了很多内存）

Answer 1

您不应将字符串编码为 UTF-16（\ud8..\udd..），而应编码为 UTF-8。要转换它，运行这个命令：

perl -e 'print pack("H*","d83ddd0e")' | iconv -f UTF-16 -t UTF-32 | hexdump -C

然后，你可以看到你的角色是U+0001F50E。要在您的 C 代码中插入此字符，请使用 \U 序列，U 大写。

"\U0001F50E"

对了，你的数字55357不是放大镜(U+1F50E)，只是UTF-16编码的放大镜的前半部分

Answer 2

fileformat.info上的信息有误。页面上的转义数为 \ud83d\udd0e。这是在 Java 上使用的 UTF-16 代理对，但它不适用于 C，因为 GCC 似乎要求一个 \u 转义代表一个 Unicode 代码点，代理转义的一半是没有。

您应该改用 \U（大写）和 8 个十六进制数字，因此 U+1F50E 变为 \U0001F50E。此转义字符使用 printf.

正确输出

P.S：如果您看到的不是放大镜而是 ~_~T~N，请确保您调用了 setlocale 并且实际链接到 -lncursesw，失败做任何一个都意味着将打印垃圾。

Answer 3

您可以使用 putwchar（参见 http://www.cplusplus.com/reference/cwchar/putwchar/）打印 wchar，但我认为它不适用于 UTF-16 代理项对。

无论如何，向终端打印 unicode 文本始终是未定义的行为。在 unix 系统上，大多数终端模拟 VT-100，并且只保证支持 7 位 ASCII 文本。（这就是 isprint 函数存在的原因）。

您最好的选择是使用像 freetype2 或 cairo+pango 这样的库在图形应用程序中将文本渲染到表面或像素图中。

Answer 4

需要澄清一下，因为 OP 问了不止一个问题：

这里有什么问题？

Antti Haapala 回答了重要部分，处理了不正确表示的字符。
例如，如果我有一个十进制编码，对于放大镜来说是 55357 ，我如何在 printf 中将它打印到 ncurses 屏幕？（没有使用wchar_t因为它浪费了很多内存）

没有人回答。关于浪费内存的评论忽略了 ncurses（即 ncursesw）将所有信息存储在 复杂字符中的事实，它比 宽字符 (wchar_t).

printw 类似于 printf，但不完全相同。要看到这一点，printw manual page 说

The printw, wprintw, mvprintw and mvwprintw routines are analogous to printf [see printf(3)]. In effect, the string that would be output by printf is output instead as though waddstr were used on the given window.

要理解类似的意思，dictionary might help (part of its meaning is "similar", but those are not synonymous). But following the link to the waddstr manual page:

These functions write the (null-terminated) character string str on the given window. It is similar to calling waddch once for each character in the string.

同样，"similar" 不保证行为相同。手册页 PORTABILITY 部分的 waddch manual page gives more information. Among other things, it tells what translations it will do for control- and nonprinting-characters. Also (the point) is that waddch in ncurses accepts a multibyte (read: "UTF-8") string and will display that if the locale and terminal support that. That's different from X/Open Curses, as discussed in the Character Set 小节。

那些 \u 转义告诉 gcc 传递一个 UTF-8 字符串，它恰好与 ncurses 一起工作。关注标准的人会在是否保证与 printf 一起工作的问题上模棱两可，但我们不要陷入困境。

顺便说一句，没有使用 wchar_t 数组的 printw 的等价物。

如何在用 UTF8 编码的 C 中打印 unicode 字符？

how do I print unicode character in C encoded with UTF8?

c

unicode

ncurses

utf-8