在讨论 Unicode 编码时，符号 "U+" 是什么意思？

Question

我意识到这是非常基本的，因为我在维基百科及其指向的任何地方阅读有关 Unicode 的内容。但是这个 "U+0000" 语义没有完全解释。在我看来 "U" 总是等于 0.

为什么 "U+" 是符号的一部分？这到底是什么意思？（它似乎是一些基值，但我不明白它何时或为什么不为零。）

此外，如果我从其他来源收到一串文本，我如何知道该字符串是编码为 UTF-8、UTF-16 还是 UTF-32？有什么方法可以根据上下文自动确定吗？

Answer 1

来自维基百科，文章 Unicode, section Architecture and Terminology：

Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF (hexadecimal). Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used.

引入这个约定是为了让读者明白代码点具体是一个Unicode代码点。例如，字母ă（拉丁文小写字母A WITH BREVE）是U+0103；在代码页 852 中它的代码是 0xC7，在代码页 1250 中它的代码是 0xE3，但是当我写 U+0103 时每个人都明白我的意思是 Unicode 代码点并且他们可以查找它。
对于使用拉丁字母书写的语言，UTF-16 和 UTF-32 字符串很可能包含大量值为 0 的字节，这不应出现在 UTF-8 编码字符串中.通过查看哪个字节为零，您还可以推断出 UTF-16 和 UTF-32 字符串的字节顺序，即使没有 Byte Order Mark.

例如，如果你得到字节
```
 0xC3 0x89 0x70 0xC3 0xA9 0x65
```
这很可能是 Épée UTF-8 编码。在小端 UTF-16 中，这将是
```
 0x00 0xC9 0x00 0x70 0x00 0xE9 0x00 0x65
```
（注意每个偶数字节都是零。）

what is meant by the notation "U+" when discussing Unicode encoding?