如何以 UTF-8 格式打开 HTML 文件进行解析？

Question

我正在尝试使用 BeautifulSoup 和 python 3 解析 html 文件，但出现 UTF-8 解码错误。我已经尝试添加打开文件解码为 UTF-8 的选项，但错误仍然出现。

如何解决这个问题？

这是我目前所拥有的。

from bs4 import BeautifulSoup

with open("file.html") as fp:                      
    unicode_html = fp.read().decode('utf-8', 'ignore')  

soup = BeautifulSoup( unicode_html)

Traceback (most recent call last):          
/usr/lib/python3.8/codecs.py", line 322, in decode        

(result, consumed) = self._buffer_decode(data, self.errors, final) 

 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 30287: invalid continuation byte

Answer 1

open()的默认模式是rt，以文本模式阅读。使用 rb 以二进制模式读取。目前，解码器正在输入它可能不太喜欢的解码文本。

出现UnicodeDecodeError的错误可能是由于输出设备（如控制台）不支持编码。

使用命令提示符，错误输出为

AttributeError: 'str' object has no attribute 'decode'

哪个出现更正确的错误。我还使用了

的 shebang

#!/usr/bin/env python3 -X utf8

这使得 Python 输出 UTF-8 以获得 AttributeError.

换行：

with open("file.html") as fp:

到

with open("file.html", "rb") as fp:

如何以 UTF-8 格式打开 HTML 文件进行解析？

How to open HTML file as UTF-8 for parsing it?

beautifulsoup

html-parsing

python-3.x