UTF-16是如何实现自同步的？

How does UTF-16 achieve self-synchronization?

我知道UTF-16是一种自同步编码方案。我也看了下面的维基，但不是很明白。

你能举个UTF-16的例子给我解释一下吗？

在 UTF-16 中，BMP 之外的字符使用位于 0xD800—0xDBFF 之间的 surrogate pair in with the first code unit (CU) 表示，第二个位于 0xDC00—0xDFFF 之间。每个 CU 代表代码点的 10 位。 BMP 中的字符被编码为自身。

现在同步很容易了。给定任意代码单元的位置：

如果代码单元在0xD800—0xDBFF范围内，则为两个代码单元中的第一个，直接读取下一个解码即可。瞧，我们有一个 BMP 之外的完整字符
如果代码单元在0xDC00—0xDFFF范围内，则为两个代码单元中的第二个代码单元，只返回一个单元阅读第一部分，或前进到下一个单元跳过当前字符
如果它不在这两个范围内，那么它就是 BMP 中的一个字符。我们不需要再做任何事情

在UTF-16中CU是单位，即最小的元素。我们在 CU 级别工作，逐个读取 CU，而不是逐字节读取。 由于这一点以及历史原因，UTF-16 只能在 CU 级别自同步。

自同步的要点是立即知道我们是否正处于某事的中间，而不必从头开始重新阅读并检查。 UTF-16 允许我们这样做

Since the ranges for the high surrogates, low surrogates, and valid BMP characters are disjoint, it is not possible for a surrogate to match a BMP character, or for (parts of) two adjacent characters to look like a legal surrogate pair. This simplifies searches a great deal. It also means that UTF-16 is self-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units. UTF-8 shares these advantages, but many earlier multi-byte encoding schemes (such as Shift JIS and other Asian multi-byte encodings) did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string (UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random byte).

https://en.wikipedia.org/wiki/UTF-16#Description

当然，这意味着 UTF-16 可能不适合在没有错误的介质上工作 correction/detection，例如裸网络环境。然而，在适当的本地环境中，它比没有自同步的工作要好得多。例如，在 DOS/V for Japanese 中，每次按下 Backspace 时，您必须从头开始迭代以了解删除了哪个字符，因为在糟糕的 Shift-JIS 编码中，无法知道删除了多长时间光标前的字符没有长度映射

UTF-16是如何实现自同步的？

How does UTF-16 achieve self-synchronization?

unicode

utf-16

character-encoding

data-synchronization