异步抓取网站时编码错误

Question

我这里有以下代码：

import aiohttp
import asyncio
from bs4 import BeautifulSoup


async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
            soup = BeautifulSoup(await response.text(), features="lxml")
            print(soup)

asyncio.run(main())

但是，它给了我第 await response.text() 行的错误 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte。我认为问题在于 url 以 .htm 而不是 .com 结束。

有什么破解方法吗？
注意：我不想使用 response.read()。

Answer 1

该网站的 headers 表明该页面应编码为 UTF-8，但显然不是：

$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm  | grep -i charset
content-type: text/html; charset=UTF-8

让我们检查一下内容：

>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'

看起来这应该是“Fußball”，如果使用 UTF-8 编码，它将是 b'Fu\xc3\x9fball'。

如果我们在 Triplee's Table of Legacy 8-bit Encodings 中查找 0xdf，我们会发现它在以下任何一种编码中都代表“ß”：

cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos

如果没有任何其他信息，我会选择 latin-1 作为编码；然而，将 request.content 传递给 Beautiful Soup 并让它处理解码可能会更简单。

Answer 2

想知道为什么不在这里使用 pandas？

import pandas as pd

url = 'https://www.pro-football-reference.com/years/2021/defense.htm'

df = pd.read_html('https://www.pro-football-reference.com/years/2021/defense.htm', header=1)[0]
df = df[df['Rk'].ne('Rk')]

异步抓取网站时编码错误

Encoding error whilst asynchronously scraping website

python

asynchronous

decode

beautifulsoup

web-scraping