怎么会有固定宽度的Unicode编码呢?

How can there be a fixed width Unicode encoding?

我在阅读 Unicode 时多次听说 UTF-32 是一种固定宽度的编码。

将固定宽度编码表示为 "a code which maps source symbols to a set number of bits," 并且假设所讨论的源符号是 Unicode 代码点,这一切都是有道理的。但是,如果您将源符号的底层语言视为字素,事情就会变得复杂得多。

所以我的问题是,从字素的意义上来说,UTF-32 真的是一种固定长度的编码吗?如果不是,是否存在这种意义上的固定长度编码?

其中一条评论引用了 Joel Spolsky 的 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 文章,该文章写于 2003 年。当时,它起到了敲响警钟的作用(在某些地方可能仍然如此)。然而,它并非没有(次要但重要的)技术问题——尽管整个论点 ('you need to know about Unicode, and you need to know which encoding a string is in') 仍然有效。然后评论继续:

And yes, UTF-16 and UTF-32 are both fixed width. UTF-8 … isn't.

UTF-16 并不是真正的固定宽度;一些 Unicode 代码点是一个 16 位代码单元,其他的需要两个 16 位代码单元——就像 UTF-8 不是固定宽度一样;一些 Unicode 代码点需要一个 8 位代码单元,其他一些需要两个、三个甚至四个 8 位代码单元(但不是五个或六个,尽管 Joel 的文章中提到了这种可能性)。另一方面,UTF-32 是固定宽度的;所有 Unicode 代码点都可以编码在一个 32 位代码单元中。 (实际上,最大可能的 Unicode 代码点是 U+10FFFF,因此 Unicode 是一个 21 位代码集,尽管它并没有使用 21 位的所有可能组合。)

但是,代码点与字符并不相同,更不用说字素了。 Unicode FAQ 有一节关于 Characters and Combining Marks discusses graphemes, referencing the glossary 定义。

The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

Graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes.

Q: How are characters counted when measuring the length or position of a character in a string?

A: Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.

在这里解决问题:

如果您的意思是与 'it can take multiple Unicode code points to get a complete character (grapheme) with associated diacritics (combining markers, etc.)' 有关,那么是的,即使 UTF-32 也不一定是固定宽度的,并且 Unicode 没有固定宽度编码。

UTF-32 对每个 Unicode 代码点采用固定宽度编码,但由于它可能需要多个代码点来创建一个完整的字素,即使 UTF-32 也没有代码之间的 1:1 映射点和字素。

当然你也可以在SO的一些评论中找到有趣的字符栈。例如:



@̮̘̮̜̤͓͓̓ͪ̓͆͗̑ṷ̫̠̤̙̻͚̗ͭS̹͓̰̫͉̲̺̈̏̽̅̑ͩS̹͓̰̫͉̲̺̈̏̽̅̑ͩe̓̉e͖̝̦̦̿e͖̝̦̦̿ r͔̒̿̋̂̓n̹͖̥ͥͦͤ̍͊̏e̫̠͇̰̱̦̹͗͋̓̿͒m̭͇̂͆͋̋͒e̫̠͇̰̱̦̹͗͋̓̿͒b̜̥̣̬̮͈͒̄ͪ͊l̮͉̣̟̪̪̿̍ͫ͋͐̑a̜̦̪͗͗̈́ͣ͊ḫ̘̯͈̠̞͒ͯb̖̣͇̖̦̃̑ͬͭͥl̮͉̣̟̪̪̿̍ͫ͋͐̑a̜̦̪͗͗̈́ͣ͊ḫ̘̯͈̠̞͒ͯb̖̣͇̖̦̃̑ͬͭͥl͔͍͚͕̲̪̼͎ͧh̘͓͔̟͔͍̏ͣͦ̓̓h̘͓͔̟͔͍̏ͣͦ̓̓b̙͍̼̜͍̹̬̬͎ͥ̓ͯ̂ḽ̜̟̲̾̅̆ͦ̃ͨh̘͓͔̟͔͍̏ͣͦ̓̓b̙͍̼̜͍̹̬̬͎ͥ̓ͯ̂ḽ̜̟̲̾̅̆ͦ̃ͨa͇̰̝̺͊ͧͫ͛a͇̰̝̺͊ͧͫ͛h̯̻͉̉̒̉̈́́ͥ̀。

Why/how do "Zalgo pings" work?

How does Zalgo text work?



ȩ̸҉̟͎͚̹͚̙̟̖x̨͙̰͕̖͉̼̜̲̦̟͈́ͅͅą̷̘͕͈̹͓̣̮̼̣̠̹́c̼͙̠̭̫̰͈͍̮͢͡ţ̢̛̠͇̬̖̟̺͈̲̻̣̲͙͈̼͍̘̱ͅl̶͘‌ ̷̨̲͙͖̻̲̗̦͚͙̮͠ y̭̖̰͚̞̣̗̳̠͕̻̼͡ͅ!̛͖̮͔͍̰͉͢



当然,您看到的内容取决于浏览器中 Unicode 支持的质量(反过来,这部分取决于 O/S 支持的质量)。我在两个不同的 Mac 运行 相当不同版本的 Firefox 上看到了不同的结果,即使它们 运行 相同的基础 O/S 版本(10.10.4 Yosemite)。

这些示例中的第二个可以从 UTF-8 解码为以下 Unicode 代码点序列 — 它在磁盘上只有 700 字节:

0xC8 0xA8 = U+0228
0xCC 0xB8 = U+0338
0xD2 0x89 = U+0489
0xCC 0x9F = U+031F
0xCD 0x8E = U+034E
0xCD 0x9A = U+035A
0xCC 0xB9 = U+0339
0xCD 0x9A = U+035A
0xCC 0x99 = U+0319
0xCC 0x9F = U+031F
0xCC 0x96 = U+0316
0x78 = U+0078
0xCC 0xA8 = U+0328
0xCD 0x99 = U+0359
0xCC 0xB0 = U+0330
0xCD 0x95 = U+0355
0xCC 0x96 = U+0316
0xCD 0x89 = U+0349
0xCC 0xBC = U+033C
0xCC 0x9C = U+031C
0xCC 0xB2 = U+0332
0xCC 0xA6 = U+0326
0xCC 0x9F = U+031F
0xCD 0x88 = U+0348
0xCC 0x81 = U+0301
0xCD 0x85 = U+0345
0xCD 0x85 = U+0345
0xC4 0x85 = U+0105
0xCC 0xB7 = U+0337
0xCC 0x98 = U+0318
0xCD 0x95 = U+0355
0xCD 0x88 = U+0348
0xCC 0xB9 = U+0339
0xCD 0x93 = U+0353
0xCC 0xA3 = U+0323
0xCC 0xAE = U+032E
0xCC 0xBC = U+033C
0xCC 0xA3 = U+0323
0xCC 0xA0 = U+0320
0xCC 0xB9 = U+0339
0xCC 0x81 = U+0301
0x63 = U+0063
0xCC 0xBC = U+033C
0xCD 0x99 = U+0359
0xCC 0xA0 = U+0320
0xCC 0xAD = U+032D
0xCC 0xAB = U+032B
0xCC 0xB0 = U+0330
0xCD 0x88 = U+0348
0xCD 0x8D = U+034D
0xCC 0xAE = U+032E
0xCD 0xA2 = U+0362
0xCD 0xA1 = U+0361
0xC5 0xA3 = U+0163
0xCC 0xA2 = U+0322
0xCC 0x9B = U+031B
0xCC 0xA0 = U+0320
0xCD 0x87 = U+0347
0xCC 0xAC = U+032C
0xCC 0x96 = U+0316
0xCC 0x9F = U+031F
0xCC 0xBA = U+033A
0xCD 0x88 = U+0348
0xCC 0xB2 = U+0332
0xCC 0xBB = U+033B
0xCC 0xA3 = U+0323
0xCC 0xB2 = U+0332
0xCD 0x99 = U+0359
0xCD 0x88 = U+0348
0xCC 0xBC = U+033C
0xCD 0x8D = U+034D
0xCC 0x98 = U+0318
0xCC 0xB1 = U+0331
0xCD 0x85 = U+0345
0x6C = U+006C
0xCC 0xB6 = U+0336
0xCD 0x98 = U+0358
0xE2 0x80 0x8C = U+200C
0xE2 0x80 0x8B = U+200B
0xCC 0xB7 = U+0337
0xCC 0xA8 = U+0328
0xCC 0xB2 = U+0332
0xCD 0x99 = U+0359
0xCD 0x96 = U+0356
0xCC 0xBB = U+033B
0xCC 0xB2 = U+0332
0xCC 0x97 = U+0317
0xCC 0xA6 = U+0326
0xCD 0x9A = U+035A
0xCD 0x99 = U+0359
0xCC 0xAE = U+032E
0xCD 0xA0 = U+0360
0x79 = U+0079
0xCC 0xAD = U+032D
0xCC 0x96 = U+0316
0xCC 0xB0 = U+0330
0xCD 0x9A = U+035A
0xCC 0x9E = U+031E
0xCC 0xA3 = U+0323
0xCC 0x97 = U+0317
0xCC 0xB3 = U+0333
0xCC 0xA0 = U+0320
0xCD 0x95 = U+0355
0xCC 0xBB = U+033B
0xCC 0xBC = U+033C
0xCD 0xA1 = U+0361
0xCD 0x85 = U+0345
0x21 = U+0021
0xCC 0x9B = U+031B
0xCD 0x96 = U+0356
0xCC 0xAE = U+032E
0xCD 0x94 = U+0354
0xCD 0x8D = U+034D
0xCC 0xB0 = U+0330
0xCD 0x89 = U+0349
0xCD 0xA2 = U+0362
0x20 = U+0020
0xCC 0xAD = U+032D
0xCC 0x99 = U+0319
0xCC 0x96 = U+0316
0xCD 0x94 = U+0354
0xCC 0xA9 = U+0329
0xCC 0x97 = U+0317
0xCC 0xA0 = U+0320
0xCD 0x95 = U+0355
0xCC 0xA6 = U+0326
0xCC 0xAC = U+032C
0xCD 0x93 = U+0353
0xCD 0x9E = U+035E
0xCD 0x9D = U+035D
0xCD 0x85 = U+0345
0x4F = U+004F
0xD2 0x89 = U+0489
0xD2 0x89 = U+0489
0xCC 0xA3 = U+0323
0xCC 0x9C = U+031C
0xCC 0xBA = U+033A
0xCC 0xAA = U+032A
0xCC 0xB3 = U+0333
0xCD 0x95 = U+0355
0xCC 0x96 = U+0316
0xCD 0x94 = U+0354
0xCC 0xA0 = U+0320
0xCD 0x99 = U+0359
0xCD 0x8E = U+034E
0xCD 0x95 = U+0355
0xCC 0x99 = U+0319
0xCC 0xA6 = U+0326
0xCD 0x85 = U+0345
0x6E = U+006E
0xCC 0xA9 = U+0329
0xCD 0x93 = U+0353
0xCD 0x96 = U+0356
0xCC 0x9D = U+031D
0xCC 0x9F = U+031F
0xCC 0xAD = U+032D
0xCD 0x99 = U+0359
0xCD 0x99 = U+0359
0xCD 0x93 = U+0353
0xCD 0x9A = U+035A
0xCC 0xBC = U+033C
0xCD 0x96 = U+0356
0xCD 0x96 = U+0356
0xCD 0x9C = U+035C
0xCD 0x9E = U+035E
0xC8 0xA9 = U+0229
0xCC 0xA7 = U+0327
0xCC 0xAC = U+032C
0xCC 0xB1 = U+0331
0xCC 0xA6 = U+0326
0xCC 0xA0 = U+0320
0xCC 0x99 = U+0319
0xCC 0xA5 = U+0325
0xCD 0x87 = U+0347
0xCD 0x94 = U+0354
0xCC 0xAA = U+032A
0xCC 0x81 = U+0301
0x20 = U+0020
0xD2 0x89 = U+0489
0xCC 0xB8 = U+0338
0xCC 0x97 = U+0317
0xCC 0xA6 = U+0326
0xCD 0x87 = U+0347
0xCC 0xB0 = U+0330
0xCC 0xAA = U+032A
0xCC 0xB0 = U+0330
0xCC 0xAD = U+032D
0xCC 0x98 = U+0318
0xCC 0xB9 = U+0339
0xCD 0x98 = U+0358
0xCD 0xA2 = U+0362
0x69 = U+0069
0xCC 0xB4 = U+0334
0xCD 0x9E = U+035E
0xCD 0x8F = U+034F
0xCC 0xA9 = U+0329
0xCC 0xA4 = U+0324
0xCC 0xB9 = U+0339
0xCC 0x97 = U+0317
0xCC 0x96 = U+0316
0xCC 0xB0 = U+0330
0xCD 0x8E = U+034E
0xCC 0x96 = U+0316
0xCC 0xB2 = U+0332
0xCC 0xB2 = U+0332
0xCC 0x98 = U+0318
0xCD 0x93 = U+0353
0xCC 0x97 = U+0317
0xCC 0xAF = U+032F
0xCD 0x9A = U+035A
0xCC 0x9E = U+031E
0xCD 0x96 = U+0356
0xCC 0xA5 = U+0325
0xCC 0xBB = U+033B
0xCD 0x9D = U+035D
0x73 = U+0073
0xCD 0x9E = U+035E
0xD2 0x89 = U+0489
0xCC 0xB2 = U+0332
0xCD 0x88 = U+0348
0xCC 0x99 = U+0319
0xCC 0xB9 = U+0339
0xCC 0xA4 = U+0324
0xCC 0xAB = U+032B
0xCD 0x87 = U+0347
0x20 = U+0020
0xCD 0x9A = U+035A
0xCC 0xAD = U+032D
0xCD 0x8E = U+034E
0xCD 0x89 = U+0349
0xCC 0xA0 = U+0320
0xCC 0xBA = U+033A
0xCD 0x89 = U+0349
0xCC 0xAE = U+032E
0xCC 0x9E = U+031E
0xCC 0xBB = U+033B
0xCC 0xA3 = U+0323
0xCC 0xB0 = U+0330
0xCC 0xBA = U+033A
0xCC 0x96 = U+0316
0xCD 0x96 = U+0356
0xCC 0x80 = U+0300
0xCC 0x81 = U+0301
0xCD 0xA2 = U+0362
0xCD 0x9E = U+035E
0x65 = U+0065
0xCC 0xB7 = U+0337
0xCC 0xAA = U+032A
0xCC 0xAD = U+032D
0xCC 0xAF = U+032F
0xCC 0xBC = U+033C
0xCD 0x93 = U+0353
0xCD 0x8E = U+034E
0xCC 0xB9 = U+0339
0xCC 0xA0 = U+0320
0xCD 0x96 = U+0356
0xCC 0xB2 = U+0332
0xCD 0x94 = U+0354
0xCC 0xAA = U+032A
0xCD 0x88 = U+0348
0xCC 0xA6 = U+0326
0xCD 0x88 = U+0348
0xCC 0xB1 = U+0331
0xCD 0x8D = U+034D
0xCC 0xAD = U+032D
0xCC 0xA9 = U+0329
0xCD 0xA0 = U+0360
0xC5 0x86 = U+0146
0xCD 0x9E = U+035E
0xD2 0x89 = U+0489
0xCC 0xAE = U+032E
0xCC 0xB3 = U+0333
0xCD 0x93 = U+0353
0xCD 0x99 = U+0359
0xCD 0x88 = U+0348
0xCC 0xBC = U+033C
0xCD 0x89 = U+0349
0xCC 0xAC = U+032C
0xCD 0x95 = U+0355
0xCD 0x88 = U+0348
0xCC 0xBA = U+033A
0xCD 0x88 = U+0348
0xCC 0xAD = U+032D
0xCC 0xA9 = U+0329
0xCC 0xAA = U+032A
0x6F = U+006F
0xCD 0x87 = U+0347
0xCC 0x97 = U+0317
0xCC 0xB1 = U+0331
0xCC 0xA0 = U+0320
0xCC 0xB1 = U+0331
0xCC 0xA0 = U+0320
0xCC 0xAF = U+032F
0xCC 0x95 = U+0315
0xCD 0xA2 = U+0362
0x75 = U+0075
0xCC 0xB8 = U+0338
0xCC 0xB3 = U+0333
0xCC 0xA6 = U+0326
0xCC 0xA9 = U+0329
0xCC 0xB3 = U+0333
0xCC 0xAB = U+032B
0xCC 0x96 = U+0316
0xCC 0x9C = U+031C
0xCD 0x85 = U+0345
0xE2 0x80 0x8C = U+200C
0xE2 0x80 0x8B = U+200B
0xC7 0xB5 = U+01F5
0xCC 0xA2 = U+0322
0xCC 0xB2 = U+0332
0xCC 0xA3 = U+0323
0xCD 0x8E = U+034E
0xCC 0xAE = U+032E
0xCC 0xAE = U+032E
0xCC 0xBC = U+033C
0xCC 0xAB = U+032B
0xCC 0xA5 = U+0325
0xCC 0xA0 = U+0320
0xCD 0x99 = U+0359
0xCC 0xB1 = U+0331
0xCC 0x9D = U+031D
0xCC 0x98 = U+0318
0xCD 0x95 = U+0355
0xCD 0x8E = U+034E
0xCC 0xB3 = U+0333
0xCC 0x9C = U+031C
0xCC 0xB2 = U+0332
0xCC 0x96 = U+0316
0x68 = U+0068
0xCC 0xB8 = U+0338
0xCC 0x9B = U+031B
0xCC 0xA9 = U+0329
0xCD 0x9A = U+035A
0xCC 0xAE = U+032E
0xCC 0xA4 = U+0324
0xCC 0x96 = U+0316
0xCC 0xB9 = U+0339
0xCD 0x99 = U+0359
0x2E = U+002E
0xCC 0xB6 = U+0336
0xCC 0xA8 = U+0328
0xCC 0xB3 = U+0333
0xCC 0x96 = U+0316
0xCC 0xA0 = U+0320
0xCC 0x97 = U+0317
0xCC 0xBC = U+033C
0xCC 0xA9 = U+0329
0xCD 0x95 = U+0355
0xCD 0x87 = U+0347
0xCD 0x89 = U+0349
0xCD 0x93 = U+0353
0xCC 0x9F = U+031F
0xCC 0xA6 = U+0326
0xCD 0x9C = U+035C
0xCD 0x9E = U+035E
0xCD 0x85 = U+0345
0x0A = U+000A

很难破译其中哪些部分是字素,但对于所有堆叠的字符来说,这显然不是每个字素的固定数据量,并且没有使 Unicode 以固定宽度工作的明智方法每个字素编码,因为如 'Zalgo' 示例所示,组合标记基本上可以按任意顺序组合。

第二个 'Zalgo' 示例中的第一个字素包含:

0xC8 0xA8 = U+0228    LATIN CAPITAL LETTER E WITH CEDILLA
0xCC 0xB8 = U+0338    COMBINING LONG SOLIDUS OVERLAY
0xD2 0x89 = U+0489    CYRILLIC COMBINING MILLIONS SIGN
0xCC 0x9F = U+031F    COMBINING PLUS SIGN BELOW
0xCD 0x8E = U+034E    COMBINING UPWARDS ARROW BELOW
0xCD 0x9A = U+035A    COMBINING DOUBLE RING BELOW
0xCC 0xB9 = U+0339    COMBINING RIGHT HALF RING BELOW
0xCD 0x9A = U+035A    COMBINING DOUBLE RING BELOW
0xCC 0x99 = U+0319    COMBINING RIGHT TACK BELOW
0xCC 0x9F = U+031F    COMBINING PLUS SIGN BELOW
0xCC 0x96 = U+0316    COMBINING GRAVE ACCENT BELOW

下一个代码点是 U+0078 拉丁文小写字母 X,一个新字素的开始。几个组合标记在该列表中分别出现了多次。

UTF-32 是一种固定宽度的编码,顺便说一下,它是唯一一种将 DWORD 值直接映射到 Unicode 代码点的 Unicode 编码。但是有一个值的限制,最高值为 0x10FFFF 并且整个高代理项和低代理项范围在 UTF-32 中都是无效的。