Delphi 字符串中的索引字符而不是字节

Index character instead of byte in the Delphi string

我正在阅读索引到 Delphi 字符串的文档，如下所示：

http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)

一个声明说：

You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.

如果我理解正确，S[i] 是字符串第 i 个字节的索引。如果S是一个UnicodeString，那么S[1]是第一个字节，S[2]是第一个字符的第二个字节，S[3]是第一个字符的第一个字节第二个字符等。如果是这种情况，那么如何索引字符而不是字符串中的字节？我需要索引字符，而不是字节。

在 Delphi 中，S[i] 是 char 又名 widechar。但这不是 Unicode "character"，它是 16 位（2 字节）的 UTF-16 编码值。在上个世纪，即直到1996年，Unicode是16位的，但现在已经不是这样了！请仔细阅读Unicode FAQ.

您可能需要几个 widechar 才能拥有一个完整的 Unicode 代码点 = 或多或少我们通常所说的 "character"。如果使用变音符号，即使这样也可能是错误的。

UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.)

Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

见UTF-16 FAQ

要正确解码 Delphi 中的 Unicode 代码点，请参阅（@LURD 在评论中的 link）

Delphi 字符串中的索引字符而不是字节

Index character instead of byte in the Delphi string

delphi

delphi-xe3