如何在使用 python lxml 向下滚动时抓取提供更多信息的 html 页面

Question

我正在从 https://www.basketball-reference.com/players/p/parsoch01.html 中抓取文本。但是我无法抓取页面中 "Total" table 下面的内容。我想从 "Total" 和 "Advanced" table 获取号码，但代码 returns 什么都没有。当用户向下滚动页面时，页面似乎加载了额外的信息。

我运行下面的代码成功地从玩家的个人资料部分和 "Per Game" table 获取数据。但是无法从 "Total" table.

中获取值

from lxml import html
import urllib
playerURL=urllib.urlopen("https://www.basketball-reference.com/players/p/parsoch01.html")
# Use xpath to parse points per game.
ppg=playerPage.xpath('//tr[@id="per_game.2019"]//td[@data-stat="pts_per_g"]//text()')[0]# succeed to get the value
total=playerPage.xpath('//tr[@id="totals.2019"]//td[@data-stat="fga"]//text()')// I expect 182 to be returned but nothing is returned.

有什么方法可以从这个页面的下半部分获取数据吗？

Answer 1

打开您的网络浏览器的控制台并测试 xpath 以查看它是否找到了您要查找的元素。

$x("//tr[@id='totals.2019']//td[@data-stat='fga']//text()")

Returns 一个数组对象。

$x("//tr[@id='totals.2019']//td[@data-stat='fga']//text()")[0]

访问你想要的值。

另外：

# comments in python start with '#' not '//'

Answer 2

这是因为您要从该站点提取的内容在评论中。 BeautifulSoup 无法解析评论中的内容。要获得结果，您需要先取消注释，以便 BeautifulSoup 可以访问它。以下脚本完全符合我的要求：

import requests
from bs4 import BeautifulSoup

URL = "https://www.basketball-reference.com/players/p/parsoch01.html"

r = requests.get(URL).text
#kick out the comment signs from html elements so that BeautifulSoup can access them
comment = r.replace("-->", "").replace("<!--", "")
soup = BeautifulSoup(comment,"lxml")
total = soup.select_one("[id='totals.2019'] > [data-stat='fga']").text
print(total)

输出：

如何在使用 python lxml 向下滚动时抓取提供更多信息的 html 页面

How to scrape the html page that provides more information while scrolling down by using python lxml

web-scraping

python-3.x

lxml.html