将 std::string 中的每个字符存储到 std::string 中

Question

我想知道您将使用什么方法从 std::string 中获取每个字符并将其存储在另一个 std::string 中。

当 std::string 包含特殊字符（例如“á”）时，我发现了问题。如果我这样做：

std::string test = "márcos";

std::string char1 = std::string(1, test.at(0));
std::string char2 = std::string(1, test.at(1));
std::string char3 = std::string(1, test.at(2));
std::string char4 = std::string(1, test.at(3));

std::cout << "Result: " << char1 << " -- " << char2 << " -- " << char3  << " -- " << char4 << std::endl;

输出：结果：m -- �� -- �� -- r

如您所见，期望的结果是：“m - á - r - c”，但事实并非如此，因为特殊字符存储为两个字符。

我们如何解决这个问题？谢谢:)

Answer 1

用于在 UTF-8 中编码代码点的字节数 (between one and four) 可以通过查看前导字节的高位来确定。

bytes    codepoints             byte 1    byte 2    byte 3    byte 4
  1      U+0000  .. U+007F      0xxxxxxx        
  2      U+0080  .. U+07FF      110xxxxx  10xxxxxx        
  3      U+0800  .. U+FFFF      1110xxxx  10xxxxxx  10xxxxxx        
  4      U+10000 .. U+10FFFF    11110xxx  10xxxxxx  10xxxxxx  10xxxxxx

以下将 UTF-8 编码 std::string 分解为单个字符。

#include <string>
#include <iostream>

int bytelen(char c)
{
    if(!c)                  return 0;   // empty string
    if(!(c & 0x80))         return 1;   // ascii char       ($)
    if((c & 0xE0) == 0xC0)  return 2;   // 2-byte codepoint (¢)
    if((c & 0xF0) == 0xE0)  return 3;   // 3-byte codepoint (€)
    if((c & 0xF8) == 0xF0)  return 4;   // 4-byte codepoint ()

    return -1;                          // error
}

int main()
{
    std::string test = "$¢€";
    std::cout << "'" << test << "' length = " << test.length() << std::endl;

    for(int off = 0, len; off < test.length(); off += len)
    {
        len = bytelen(test[off]);
        if(len < 0) return 1;

        std::string chr = test.substr(off, len);
        std::cout << "'" << chr << "'" << std::endl;
    }

    return 0;
}

Output:

'$¢€' length = 10
'$'
'¢'
'€'
''

将 std::string 中的每个字符存储到 std::string 中

Store each character from a std::string into a std::string

c++

string

std

stdstring