如何在代码中初始化 UTF-16？

Question

使用 Python3 来减少处理 Unicode 时的痛苦，我可以这样打印一个 UTF-8 字符：

>>> print (u'\u1010')
တ

但是当尝试用 UTF-16 做同样的事情时，假设 U+20000，u'\u20000' 是初始化字符的错误方法：

>>> print (u'\u20000')
    0
>>> print (list(u'\u20000'))
['\u2000', '0']

它改为读取 2 个 UTF-8 字符。

我也试过大 U，即 u'\U20000'，但它会抛出一些转义错误：

>>> print (u'\U20000')
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape

字符串外的大U也不行:

>>> print (U'\u20000')
 0
>>> print (U'\U20000')
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape

Answer 1

正如 @Mark Ransom 评论的那样，Python 的 UTF16 \U 符号需要八个字符才能工作。

因此，要使用的 Python 代码是：

u"\U00020000"

如 this page 所列：

Python source code u"\U00020000"

Answer 2

这些不是 UTF-8 和 UTF-16 文字，而只是 unicode 文字，它们的含义相同：

>>> print(u'\u1010')
တ
>>> print(u'\U00001010')
တ
>>> print(u'\u1010' == u'\U00001010')
True

第二种形式只允许您在 U+FFFF 之上指定一个代码点。

最简单的方法：将您的源文件编码为 UTF-8（或 UTF-16），然后您只需编写 u"တ" 和 u"".

UTF-8 和 UTF-16 是将它们编码为字节的方法。从技术上讲，在 UTF-8 中是 "\xf0\xa0\x80\x80"（我可能会写成 u"".encode("utf-8")）。

如何在代码中初始化 UTF-16？

How to initialize a UTF-16 in code?

python

string

unicode

character

utf-16