Delphi 字符串中的索引字符而不是字节
Index character instead of byte in the Delphi string
我正在阅读索引到 Delphi 字符串的文档,如下所示:
http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)
一个声明说:
You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.
如果我理解正确,S[i]
是字符串第 i
个字节的索引。如果S
是一个UnicodeString
,那么S[1]
是第一个字节,S[2]
是第一个字符的第二个字节,S[3]
是第一个字符的第一个字节第二个字符等。如果是这种情况,那么如何索引字符而不是字符串中的字节?我需要索引字符,而不是字节。
在 Delphi 中,S[i]
是 char
又名 widechar
。但这不是 Unicode "character",它是 16 位(2 字节)的 UTF-16 编码值。在上个世纪,即直到1996年,Unicode是16位的,但现在已经不是这样了!请仔细阅读Unicode FAQ.
您可能需要几个 widechar
才能拥有一个完整的 Unicode 代码点 = 或多或少我们通常所说的 "character"。如果使用变音符号,即使这样也可能是错误的。
UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.
Originally, Unicode was designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be
represented with private-use characters.)
Over time, and especially
after the addition of over 14,500 composite characters for
compatibility with legacy sets, it became clear that 16-bits were not
sufficient for the user community. Out of this arose UTF-16.
要正确解码 Delphi 中的 Unicode 代码点,请参阅 (@LURD 在评论中的 link)
我正在阅读索引到 Delphi 字符串的文档,如下所示:
http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)
一个声明说:
You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.
如果我理解正确,S[i]
是字符串第 i
个字节的索引。如果S
是一个UnicodeString
,那么S[1]
是第一个字节,S[2]
是第一个字符的第二个字节,S[3]
是第一个字符的第一个字节第二个字符等。如果是这种情况,那么如何索引字符而不是字符串中的字节?我需要索引字符,而不是字节。
在 Delphi 中,S[i]
是 char
又名 widechar
。但这不是 Unicode "character",它是 16 位(2 字节)的 UTF-16 编码值。在上个世纪,即直到1996年,Unicode是16位的,但现在已经不是这样了!请仔细阅读Unicode FAQ.
您可能需要几个 widechar
才能拥有一个完整的 Unicode 代码点 = 或多或少我们通常所说的 "character"。如果使用变音符号,即使这样也可能是错误的。
UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.
Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.)
Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.
要正确解码 Delphi 中的 Unicode 代码点,请参阅