Delphi 字符串中的索引字符而不是字节

Index character instead of byte in the Delphi string

我正在阅读索引到 Delphi 字符串的文档,如下所示:

http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)

一个声明说:

You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.

如果我理解正确,S[i] 是字符串第 i 个字节的索引。如果S是一个UnicodeString,那么S[1]是第一个字节,S[2]是第一个字符的第二个字节,S[3]是第一个字符的第一个字节第二个字符等。如果是这种情况,那么如何索引字符而不是字符串中的字节?我需要索引字符,而不是字节。

在 Delphi 中,S[i]char 又名 widechar。但这不是 Unicode "character",它是 16 位(2 字节)的 UTF-16 编码值。在上个世纪,即直到1996年,Unicode是16位的,但现在已经不是这样了!请仔细阅读Unicode FAQ.

您可能需要几个 widechar 才能拥有一个完整的 Unicode 代码点 = 或多或少我们通常所说的 "character"。如果使用变音符号,即使这样也可能是错误的。

UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.)

Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

UTF-16 FAQ

要正确解码 Delphi 中的 Unicode 代码点,请参阅 (@LURD 在评论中的 link)