使用 `std::wstring` 和 `std::wcout` 在 Linux 终端中打印拉丁字符

Question

我在 Linux (Ubuntu) 上用 C++ 编写代码，并尝试打印包含一些拉丁字符的字符串。

尝试调试时，我遇到了如下情况：

std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
    std::wcout << std::hex << (int)foo[i] << " ";
    std::wcout << (char)foo[i];
}

我得到的输出特征：

第一次打印显示：???
循环将三个字符的十六进制打印为 c6 d8 c5
当 foo[i] 转换为 char（或 wchar_t）时，不打印任何内容

环境变量$LANG设置为默认值en_US.UTF-8

Answer 1

在answer I linked的结论中（我仍然推荐阅读）我们可以发现：

When I should use std::wstring over std::string?

On Linux? Almost never, unless you use a toolkit/framework.

简短解释原因：

首先，Linux 是用 UTF-8 编码的，并且是它的结果（与例如 Windows 文件有一种编码而 cmd.exe 另一种编码相反）。

现在让我们来看看这个简单的程序：

#include <iostream>

int main()
{
    std::string  foo =  "ψA"; // character 'A' is just control sample
    std::wstring bar = L"ψA"; // --

    for (int i = 0; i < foo.length(); ++i) {
        std::cout  << static_cast<int>(foo[i]) << " ";
    }
    std::cout << std::endl;

    for (int i = 0; i < bar.length(); ++i) {
        std::wcout << static_cast<int>(bar[i]) << " ";
    }
    std::cout << std::endl;

    return 0;
}

输出为：

-49 -120 65 
968 65

它告诉我们什么？ 65是字符'A'的ASCII码，表示-49 -120和968对应'ψ'。

在 char 的情况下，字符 'ψ' 实际上需要两个 char。在 wchar_t 的情况下，它只是一个 wchar_t.

我们还要检查这些类型的大小：

std::cout << "sizeof(char)    : " << sizeof(char)    << std::endl;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;

输出：

sizeof(char)    : 1
sizeof(wchar_t) : 4

我机器上的 1 个字节是标准的 8 位。 char 有 1 个字节（8 位），而 wchar_t 有 4 个字节（32 位）。

UTF-8 运行于 nomen omen，具有 8 位的代码单元。有一种固定长度的 UTF-32 编码用于编码 Unicode 代码点，每个代码点正好使用 32 位（4 字节），但它是 Linux 使用的 UTF-8。

因此，终端希望得到这两个带负号的值来打印字符 'ψ'，而不是一个远高于 ASCII table 的值（代码已定义最多 127 - char 可能值的一半）。

这就是为什么 std::cout << char(-49) << char(-120); 也会打印 ψ.

But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.

字符已经编码不同，那里有不同的值，简单的转换不足以转换它们。

如我所示，char 大小为 1 个字节，wchar_t 大小为 4 个字节。你可以安全地向上施法，而不是向下施法。

使用 `std::wstring` 和 `std::wcout` 在 Linux 终端中打印拉丁字符

Printing Latin characters in Linux terminal using `std::wstring` and `std::wcout`

c++

wchar-t

character-encoding

简短解释原因：