异步抓取网站时编码错误

Encoding error whilst asynchronously scraping website

我这里有以下代码:

import aiohttp
import asyncio
from bs4 import BeautifulSoup


async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
            soup = BeautifulSoup(await response.text(), features="lxml")
            print(soup)

asyncio.run(main())

但是,它给了我第 await response.text() 行的错误 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte。我认为问题在于 url 以 .htm 而不是 .com 结束。

有什么破解方法吗?
注意:我不想使用 response.read()

该网站的 headers 表明该页面应编码为 UTF-8,但显然不是:

$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm  | grep -i charset
content-type: text/html; charset=UTF-8

让我们检查一下内容:

>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'

看起来这应该是“Fußball”,如果使用 UTF-8 编码,它将是 b'Fu\xc3\x9fball'

如果我们在 Triplee's Table of Legacy 8-bit Encodings 中查找 0xdf,我们会发现它在以下任何一种编码中都代表“ß”:

cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos

如果没有任何其他信息,我会选择 latin-1 作为编码;然而,将 request.content 传递给 Beautiful Soup 并让它处理解码可能会更简单。

想知道为什么不在这里使用 pandas

import pandas as pd

url = 'https://www.pro-football-reference.com/years/2021/defense.htm'

df = pd.read_html('https://www.pro-football-reference.com/years/2021/defense.htm', header=1)[0]
df = df[df['Rk'].ne('Rk')]