尽管存在 Windows 或 MinGW 错误,但仍可正常显示标准输出的大型 UTF-8 编码字符串

Display large UTF-8-encoded strings for standard output decently, despite Windows or MinGW bugs

2nd Update: I found a very simple to this actually not that hard problem, only one day after asking. But people seem to be small-minded so there are three close votes already:

  1. Duplicate of "How to use unicode characters in Windows command line?" (1x):

    Obviously not, which has been clarified in the comments. This is not about the Windows command line tool, which I do not use.

  2. Unclear what you're asking (1x):

    Then you must suffer from functional analphabetism. I cannot be any more concrete when I ask, for example "Is there an easy way to determine whether a char in a std::string is a non-ending part of an UTF-8 symbol?" (marked bold for better visibility, indeed) and state that this would be sufficient to answer the question (and even explain why). Seriously, there are even pictures to show the problem. Furthermore, my own existing answer should clarify it even more. Your own deficiencies are not sufficient to declare something as too hard to understand.

  3. Too broad (1x) ("Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer [...]"):

    This must be another issue with functional analphabetism. I stated clearly that a single way to solve the problem (which I have already found) is sufficient. You can identify an adequate answer as follows: Take a look at the of my own. Alternatively, use your brain to interprete my well-defined words if you are able to, which several people on this plattform unfortunately seem not.

There is, however, an actual reason to close this question: It has already been solved. But there is no such option for a close vote. So, cleary, Stack Exchange supports that there may be found alternative solutions. Since I am a curious person, I am also interested in alternative ways to solve this. If your lack of intelligence does not cope well with understanding what the problem is and that it is quite relevant under certain environments (e.g. such that use Windows, C++ in Eclipse CDT, UTF-8, but no Visual Studio and no Windows Console), then you can just leave without standing in the way of other people to satisfy their curiosity. Thanks!

1st Update: I used app.exe > out.txt 2>&1 which generates a file without these formatting issues. So the problem is that usually std::cout does this splitting but the underlying control (which receives the char sequence) has to handle correct reassembling? (Unfortunately nothing seems to handle it on Windows, except file streams. So I still need to circumvent this. Preferably without writing to files first and displaying their content -- which of course works.)

在我使用的系统上 (Windows 7; MinGW-w64 (GCC 8.1 for Windows)),std::cout 有一个错误,因此 UTF-8 编码字符串在重新组装之前被打印出来,即使它们被 std::cout 通过传递一个大字符串在内部反汇编。以下代码解释了该错误的行为方式。但是请注意,错误显示似乎是随机的,即 std::cout 切片(等于)std::string 对象的方式对于程序的每次执行并不相同。但是问题始终出现在 1024 的倍数的索引处,这就是我得出的结论。

#include <iostream>
#include <sstream>

void myFaultyOutput();
void simulatedFaultyBehavior();

int main()
{
    myFaultyOutput();
    //simulatedFaultyBehavior();
}

void myFaultyOutput() {
    std::stringstream ss; // Note that ss is built correctly (which could be shown by saving ss.str() to a file).
    ss << "...";
    for (int i = 0; i < 20; i++) {
        for (int j = 0; j < 341; j++)
            ss << u8"\u301A";
        ss << "\n..";
    }
    std::cout << ss.str() << std::endl; // Problem occurs here, with cout.
    // Note that converting ss.str() to UTF-16 std::wstring and using std::wcout results in std::wcout not
    // displaying anything, not even ASCII characters in the future (until restarting the application).
}

// To display the problem on well-behaved systems ; just imagine the output would not contain newlines, while the faulty formatted characters remain.
void simulatedFaultyBehavior() {
    std::stringstream ss;
    int amount = 2000;
    for (int j = 0; j < amount; j++)
        ss << u8"\u301A";
    std::string s = ss.str();
    std::cout << "s.length(): " << s.length() << std::endl; // amount * 3
    while (s.length() > 1024) {
        std::cout << s.substr(0, 1024) << std::endl;
        s = s.substr(1024);
    }
    std::cout << s << std::endl;
}

为了避免这种行为,我想将大字符串(我从 API 接收到的字符串)手动拆分为长度小于 1024 个字符的部分(然后调用 std::cout分别在他们每个人身上)。但我不知道哪些字符实际上只是 UTF-8 符号的非结束部分,内置的 Unicode 转换器似乎也不可靠(可能也取决于系统?)。 有没有一种简单的方法可以确定 std::string 中的字符是否是 UTF-8 符号的非结束部分? 下面的引述解释了为什么回答这个问题是足够了。

An UTF-8 character can, for example, consist of three chars. So if one splits a string into two parts, it should keep those three characters together. Otherwise, one has to do what the existing GUI controls clearly are not able to do consistently. For instance, reassembling UTF-8-characters that have been split into pieces.

也欢迎更好的解决问题的想法(当然,除了 "Don't use Windows" / "Don't use UTF-8" / "Don't use cout" 之外)。

请注意,这个问题与 Windows 控制台无关(我没有使用它——内容显示在 Eclise 中,并且可以选择显示在 wxWidgets UI 元素上,它们正确显示 UTF-8) .它也与 MSVC 无关(如前所述,我使用 MinGW 编译器)。在代码中还提到使用 std::wcout 和 UTF-16 根本不起作用(由于 另一个 MinGW 一个 Eclipse 错误)。 bug 的结果是 UI 控件无法处理 std::cout 所做的事情(这可能是有意或无意的)。 此外,通常一切正常,除了那些在 1024 的倍数(并且只是随机的)的索引处被分成不同字符(例如 \u301A 到 \u0003 + \u001A)的 UTF-8 符号。这种行为已经暗示评论者的大多数假设都是错误的。请仔细考虑代码——尤其是它的注释——而不是急于下结论。

澄清调用myFaultyOutput()时的显示问题:

我通过试验详细阐述了一个相当简单的解决方法,令我惊讶的是没有人知道(我在网上找不到类似的东西)。

N.m. 的尝试回答提到了特定于平台的功能 _setmode,给出了一个很好的提示。它“按设计”(根据 this answer and this article)所做的是设置文件转换模式,这是根据流程处理输入和输出流的方式。但与此同时,它使使用 std::ostream / std::istream 无效,但指示使用 std::wostream / std::wistream 用于格式正确的输入和输出流。

例如,使用 _setmode(_fileno(stdout), _O_U8TEXT) 导致 std::wcout 现在可以很好地输出 std::wstring 作为 UTF-8,但是 std::cout 打印出垃圾字符,即使在ASCII 参数。但是我希望能够主要使用std::string,尤其是std::cout来输出。正如我所提到的,std::cout 的格式设置失败的情况很少见,因此只有在我打印出可能导致此问题的字符串的情况下(至少索引处的潜在多字符编码字符) 1024) 我想使用一个特殊的输出函数,比如说 coutUtf8String(string s).

_setmode 的默认(未翻译)模式是 _O_BINARY。我们可以临时切换模式。那么为什么不直接切换到 _O_U8TEXT,将 UTF-8 编码的 std::string 对象转换为 std::wstring,在其上使用 std::wcout,然后切换回 _O_BINARY ?为了保持平台独立性,可以在不在 Windows 上时定义通常的 std::cout 调用。这是代码:

#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
#include <fcntl.h> // Also includes the non-standard file <io.h>
                   // (POSIX compatibility layer) to use _setmode on Windows NT.
#ifndef _O_U8TEXT // Some GCC distributions such as TDM-GCC 9.2.0 require this explicit
                  // definition since, depending on __MSVCRT_VERSION__, they might
                  // not define it.
#define _O_U8TEXT 0x40000
#endif
#endif

void coutUtf8String(string s) {
#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
    if (s.length() > 1024) {
        // Set translation mode of wcout to UTF-8, renders cout unusable "by design"
        // (see https://developercommunity.visualstudio.com/solutions/411680/view.html).
        if (_setmode(STDOUT_FILENO, _O_U8TEXT) != -1) {
            wcout << utf8toWide(s) << flush; // We must flush before resetting the mode.
             // Set translation mode of wcout to untranslated, renders cout usable again.
            _setmode(STDOUT_FILENO, _O_BINARY);
        } else
            // Let's use wcout anyway. Since no sink (such as Eclipse's console
            // window) is attached when _setmode fails, and such sinks seem to be
            // the cause for wcout to fail in default mode. The UI console view
            // is filled properly like this, regardless of translation modes.
            wcout << utf8toWide(s) << flush;
    } else
        cout << s << flush;
#else
    cout << s << flush;
#endif
}

wstring utf8toWide(const char* in) {
    wstring out;
    if (in == nullptr)
        return out;
    uint32_t codepoint = 0;
    while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
            if (codepoint > 0xffff) {
                out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
                out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
            } else if (codepoint < 0xd800 || codepoint >= 0xe000)
                out.append(1, static_cast<wchar_t>(codepoint));
        }
    }
    return out;
}

这个解决方案特别方便,因为它实际上并没有弃用 UTF-8、std::stringstd::cout,它们主要用于 for good reasons, but simply uses std::string itself and sustains platform-independency. I rather agree with this answer 添加 wchar_t(以及所有它附带的多余垃圾,例如 std::wstringstd::wstringstreamstd::wostreamstd::wistreamstd::wstreambuf) 到 C++ 是一个错误。只是因为微软做出了糟糕的设计决定,所以人们不应该接受他们的错误,而应该规避它们。

视觉确认: