使用 bs4 解析 utf-8 编码页面中的特殊字符时出现问题

Question

我正在尝试解析一个页面，但遇到一些特殊字符问题，例如 é è à 等

根据Firefox页面信息工具，页面编码为UTF-8

我的代码如下：

import bs4
import requests


url = 'https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html'

page = requests.get(url)

cae_obj_soup = bs4.BeautifulSoup(page.text, 'lxml', from_encoding='utf-8')
list_all_domain = cae_obj_soup.find_all('th')

for element in list_all_domain:
    print(element.get_text())

输出为：

PÃªche et piÃ©geage
Exploitation forestiÃ¨re

我尝试用 iso-8859-1（法语编码）和其他一些编码更改编码，但没有成功。我读了几篇关于解析特殊字符的文章，他们基本上都说这是选择正确编码的问题。是否有可能我无法正确解码某些特定网页上的特殊字符，或者我做错了什么？

Answer 1

请求库需要 . On the other hand, BeautifulSoup has powerful tools for determining the encoding of text. So it's better to pass the raw response from the request to BeautifulSoup, and let BeautifulSoup try to determine the encoding。

>>> r = requests.get('https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> list_all_domain = soup.find_all('th')
>>> [e.get_text() for e in list_all_domain]
['Agriculture', "Services relatifs à l'agriculture", 'Pêche et piégeage', ...]

使用 bs4 解析 utf-8 编码页面中的特殊字符时出现问题

Issue with parsing special characters in a utf-8 encoded page with bs4

python

beautifulsoup

character-encoding