异步抓取网站时编码错误
Encoding error whilst asynchronously scraping website
我这里有以下代码:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
soup = BeautifulSoup(await response.text(), features="lxml")
print(soup)
asyncio.run(main())
但是,它给了我第 await response.text()
行的错误 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte
。我认为问题在于 url 以 .htm
而不是 .com
结束。
有什么破解方法吗?
注意:我不想使用 response.read()。
该网站的 headers 表明该页面应编码为 UTF-8,但显然不是:
$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm | grep -i charset
content-type: text/html; charset=UTF-8
让我们检查一下内容:
>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'
看起来这应该是“Fußball”,如果使用 UTF-8 编码,它将是 b'Fu\xc3\x9fball'
。
如果我们在 Triplee's Table of Legacy 8-bit Encodings 中查找 0xdf
,我们会发现它在以下任何一种编码中都代表“ß”:
cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos
如果没有任何其他信息,我会选择 latin-1 作为编码;然而,将 request.content
传递给 Beautiful Soup 并让它处理解码可能会更简单。
想知道为什么不在这里使用 pandas
?
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/defense.htm'
df = pd.read_html('https://www.pro-football-reference.com/years/2021/defense.htm', header=1)[0]
df = df[df['Rk'].ne('Rk')]
我这里有以下代码:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
soup = BeautifulSoup(await response.text(), features="lxml")
print(soup)
asyncio.run(main())
但是,它给了我第 await response.text()
行的错误 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte
。我认为问题在于 url 以 .htm
而不是 .com
结束。
有什么破解方法吗?
注意:我不想使用 response.read()。
该网站的 headers 表明该页面应编码为 UTF-8,但显然不是:
$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm | grep -i charset
content-type: text/html; charset=UTF-8
让我们检查一下内容:
>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'
看起来这应该是“Fußball”,如果使用 UTF-8 编码,它将是 b'Fu\xc3\x9fball'
。
如果我们在 Triplee's Table of Legacy 8-bit Encodings 中查找 0xdf
,我们会发现它在以下任何一种编码中都代表“ß”:
cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos
如果没有任何其他信息,我会选择 latin-1 作为编码;然而,将 request.content
传递给 Beautiful Soup 并让它处理解码可能会更简单。
想知道为什么不在这里使用 pandas
?
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/defense.htm'
df = pd.read_html('https://www.pro-football-reference.com/years/2021/defense.htm', header=1)[0]
df = df[df['Rk'].ne('Rk')]