为什么UTF-8编码是这样的？

Why is UTF-8 encoded the way it is?

如果我没理解错的话，UTF-8 使用以下模式让计算机知道将使用多少字节来编码一个字符：

Byte 1	Byte 2	Byte 3	Byte 4
0xxxxxxx
110xxxxx	10xxxxxx
1110xxxx	10xxxxxx	10xxxxxx
11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

等但是没有更紧凑的模式吗？例如，是什么阻止我们使用这样的东西：

Byte 1	Byte 2	Byte 3	Byte 4
0xxxxxxx
10xxxxxx	xxxxxxxx
110xxxxx	xxxxxxxx	xxxxxxxx
1110xxxx	xxxxxxxx	xxxxxxxx	xxxxxxxx

您建议的编码不会是 self-synchronizing。如果您到达流的中间 xxxxxxxx 字节，您将不知道它是否位于字符的中间。如果该随机字节恰好是 10xxxxxx，您可能会将其误认为是字符的开头。避免此错误的唯一方法是从头开始无错误地读取整个流。

UTF-8 的明确目标是 self-synchronizing。如果你到达 UTF-8 流中的任何地方，你知道你是否在一个字符的中间，并且最多需要读取 3 个字节才能找到下一个字符的开始一个完整的角色。