Python：将包含unicode代码点的字符串转换回普通字符

Question

我正在使用请求模块从网站上抓取文本并使用如下方法将其存储到 txt 文件中：

r = requests.get(url)
with open("file.txt","w") as filename:
        filename.write(r.text)

使用这种方法，如果“送分200000”是请求从url获得的唯一字符串，它将被解码并存储在file.txt中，如下所示。

\u9001\u5206200000

当我稍后从 file.txt 中获取字符串时，该字符串不会转换回“发送分 200000”，而是在我尝试将其打印出来时保留为“\u9001\u5206200000” .例如：


with open("file.txt", "r") as filename:
        mystring = filename.readline()
        print(mystring)

Output:
"\u9001\u5206200000"

有没有办法将这个字符串和其他类似的字符串转换回它们的原始字符串，使用 unicode 字符？

Answer 1

最好使用 io 模块。尝试针对您的问题调整以下代码。

import io
with io.open(filename,'r',encoding='utf8') as f:
    text = f.read()
# process Unicode text
with io.open(filename,'w',encoding='utf8') as f:
    f.write(text)

取自https://www.tutorialspoint.com/How-to-read-and-write-unicode-UTF-8-files-in-Python

Answer 2

将此字符串和其他类似字符串转换回具有 unicode 字符的原始字符串？

对，让file.txt内容为

\u9001\u5206200000

然后

with open("file.txt","rb") as f:
    content = f.read()
text = content.decode("unicode_escape")
print(text)

输出

送分200000

如果您想了解更多信息，请阅读 Text Encodings in codecs built-in module docs

Answer 3

我猜你正在使用 Windows。当您打开一个文件时，您将获得其默认编码，即 Windows-1252，除非您另有指定。打开文件时指定编码：

with open("file.txt","w", encoding="UTF-8") as filename:
        filename.write(r.text)
with open("file.txt", "r", encoding="UTF-8") as filename:
        mystring = filename.readline()
        print(mystring)

无论平台如何，都能如您所愿。

Python：将包含unicode代码点的字符串转换回普通字符

Python: convert strings containing unicode code point back into normal characters

python

unicode

unicode-string