使用 ZipFile 读取文件后如何编码 html 文件？

Question

我正在从 URL 读取一个 zip 文件。在 zip 文件中，有一个 HTML 文件。在我阅读文件后，一切正常。但是当我打印文本时，我遇到了 Unicode 问题。 Python版本：3.8

from zipfile import ZipFile
from io import BytesIO
from bs4 import BeautifulSoup
from lxml import html
content = requests.get("www.url.com")
zf = ZipFile(BytesIO(content.content))
file_name = zf.namelist()[0]
file = zf.open(file_name)

soup = BeautifulSoup(file.read(),'html.parser',from_encoding='utf-8',exclude_encodings='utf-8')
for product in soup.find_all('tr'):
    product = product.find_all('td')
    if len(product) < 2: continue
    print(product[1].text)

我已经尝试使用 .decode('utf-8') 打开文件并打印文本，但出现以下错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte

我在 BeautifulSoup 中添加了 from_encoding 和 exclude_encodings，但没有任何变化，我也没有收到错误。

预期打印：

ÇEŞİTLİ MADDELER TOPLAMI
Tarçın
Fidanı

我得到的是：

ÇEÞÝTLÝ MADDELER TOPLAMI
Tarçýn
Fidaný

Answer 1

我看了一下文件，编码不是utf-8，而是iso-8859-9。更改编码，一切都会好的：

soup = BeautifulSoup(file.read(),'html.parser',from_encoding='iso-8859-9')

这将输出：ÇEŞİTLİ MADDELER TOPLAMI

使用 ZipFile 读取文件后如何编码 html 文件？

How can I encode html file after read file with ZipFile?

python

unicode

beautifulsoup

character-encoding