BeautifulSoup 已解析文档与原始 html 页面代码不同

Question

我正在使用 BeautifulSoup4 从 Spotify 排行榜中抓取一些内容。

我已经准备好代码运行几周了。但是今天突然开始失败了。它开始为所有条目提供 NaN 值...

我认为问题出在 html 已解析的页面中。生成的 html 代码不同于原始网页 html。

我试过 'html.parses'、'lxml' 和 'html5lib'。我还更新了 BeautifulSoup 和所有解析器的包。但是什么都没有

可能是什么问题？我不知道 problem.Yesterday 我的 Windows 10 更新的根可能是什么，是否相关？

这是重要的代码部分：

from bs4 import BeautifulSoup as bs
import requests

u = 'https://spotifycharts.com/regional/us/daily/2021-04-18'

x = requests.get(u)  
a = bs(x.content,'html.parser')
tracks = a.find_all('td',class_='chart-table-position')

tracks 始终是 none，因为它不存在于 'a' 中。但它应该...因为它存在于网页 html 并且它存在于几天前...

在此先感谢您的帮助。

Answer 1

在 headers 中添加 user-agent 在我这边解决了这个问题。试一试：

from bs4 import BeautifulSoup as bs
import requests


u = 'https://spotifycharts.com/regional/us/daily/2021-04-18'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
x = requests.get(u, headers=headers)  
a = bs(x.content,'html.parser')
tracks = a.find_all('td',class_='chart-table-position')

输出：

print(len(tracks))
200

BeautifulSoup 已解析文档与原始 html 页面代码不同

BeautifulSoup Parsed document different than original html page's code

html

python

parsing

beautifulsoup

spotify