如何将 UTF-8 编码转换为 python 中的符号字符

Question

我使用 python 的 urllib.request API 抓取了一些网页并将读取的行保存到一个新文件中。

        f = open(docId + ".html", "w+")
        with urllib.request.urlopen('http://whosebug.com') as u:
              s = u.read()
              f.write(str(s))

但是当我打开保存的文件时，我看到很多字符串，例如\xe2\x86\x90，这在原始页面中本来是一个箭头符号。它似乎是符号的 UTF-8 编码，但如何将编码转换回符号？

Answer 1

尝试：

import urllib2, io

with io.open("test.html", "w", encoding='utf8') as fout:
    s = urllib2.urlopen('http://whosebug.com').read()
    s = s.decode('utf8', 'ignore') # or s.decode('utf8', 'replace')
    fout.write(s)

见https://docs.python.org/2/howto/unicode.html

Answer 2

您的代码已损坏：u.read() returns bytes 对象。 str(bytes_object) returns 对象的字符串表示（字节字面量的样子）——你不需要它：

>>> str(b'\xe2\x86\x90')
"b'\xe2\x86\x90'"

将字节按原样保存在磁盘上：

import urllib.request

urllib.request.urlretrieve('http://whosebug.com', 'so.html')

或以二进制模式打开文件：'wb'并手动保存：

import shutil
from urllib.request import urlopen

with urlopen('http://whosebug.com') as u, open('so.html', 'wb') as file:
    shutil.copyfileobj(u, file)

或将字节转换为 Unicode 并使用您喜欢的任何编码将它们保存到磁盘。

import io
import shutil
from urllib.request import urlopen

with urlopen('http://whosebug.com') as u, \
     open('so.html', 'w', encoding='utf-8', newline='') as file, \
     io.TextIOWrapper(u, encoding=u.headers.get_content_charset('utf-8'), newline='') as t:
    shutil.copyfileobj(t, file)

如何将 UTF-8 编码转换为 python 中的符号字符

how to convert UTF-8 code to symbol characters in python

python

unicode

utf-8

python-3.x