Python 3.8: 将非 ascii 字符转义为 unicode

Question

我有可以包含非 ASCII 字符的输入和输出文本文件。有时我需要转义它们，有时我需要写非 ascii 字符。基本上，如果我得到“Bürgerhaus”，我需要输出“B\u00FCrgerhaus”。如果我得到“B\u00FCrgerhaus”，我需要输出“Bürgerhaus”。

一个方向没问题：

>>> s1 = "B\u00FCrgerhaus"
>>> print(s1)
Bürgerhaus

然而在另一个方向我没有得到预期的结果（'B\u00FCrgerhaus'）：

>>> s2 = "Bürgerhaus"
>>> s2_trans = s2.encode('utf8').decode('unicode_escape')
>>> print(s2_trans)
BÃ¼rgerhaus

我读到 unicode-escape 需要 latin-1，我尝试将其编码为它，但这也没有产生结果。我做错了什么？

(PS: 谢谢 Matthias 提醒我第一个例子中的转换是不必要的。)

Answer 1

你只能decode() bytestrings (bytes)到[unicode]字符串，相反，encode() [unicode]字符串到bytes.

所以如果你想解码一个用 unicode-escape 转义的字符串，你需要先将它转换 (encode()) 为字节串，例如，使用 latin1 正如你在问题。

>>> encoded_str = 'B\xfcrgerhaus'
>>> encoded = encoded_str.encode('latin-1')
>>> encoded
b'B\xfcrgerhaus'
>>> encoded.decode('unicode-escape')
'Bürgerhaus'
>>> _.encode('unicode-escape')
b'B\xfcrgerhaus'
>>> _ == encoded
True

另请参阅：how do I .decode('string-escape') in Python3?

Answer 2

你可以这样做：

charList=[]
s1 = "Bürgerhaus"

for i in [ord(x) for x in s1]:
    # Keep ascii characters, unicode characters 'encoded' as their ordinal in hex
    if i < 128:  # not sure if that is right or can be made easier!
        charList.append(chr(i))
    else:
        charList.append('\u%04x' % i )

res = ''.join(charList)
print(f"Mixed up sting: {res}")

for myStr in (res, s1):
    if '\u' in myStr:
        print(myStr.encode().decode('unicode-escape'))
    else:
        print(myStr)

输出：

Mixed up sting: B\u00fcrgerhaus
Bürgerhaus
Bürgerhaus

解释：

我们将把每个字符转换成对应的 Unicode 代码点。

print([(c, ord(c)) for c in s1])
[('B', 66), ('ü', 252), ('r', 114), ('g', 103), ('e', 101), ('r', 114), ('h', 104), ('a', 97), ('u', 117), ('s', 115)]

常规 ASCII 字符十进制值 < 128，更大的值，如 Eur-Sign、德语变音符号...得到值 >= 128（详细 table here）。

现在，我们要 'encoded' 所有 >= 128 的字符及其相应的 unicode 表示。

Python 3.8: 将非 ascii 字符转义为 unicode

Python 3.8: Escape non-ascii characters as unicode

python

unicode

encoding