将 Table 数据抓取到数据框中

Question

一个例子 URL 是'http://www.hockey-reference.com/players/c/crosbsi01/gamelog/2016'

我要获取的 table 名称是常规赛季。

我以前的做法是这样的...

import requests
from bs4 import *
from bs4 import NavigableString
import pandas as pd


url = 'http://www.hockey-reference.com/players/o/ovechal01/gamelog/2016'
resultsPage = requests.get(url)
soup = BeautifulSoup(resultsPage.text, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "Regular Season  Table" in x)
df = pd.read_html(comment)

这是我对类似网站所采用的方法类型，但是，我无法在此页面上正确找到 table。不确定我错过了什么。

Answer 1

有一个 table 你可以使用 id:

import requests
from bs4 import BeautifulSoup


url = 'http://www.hockey-reference.com/players/o/ovechal01/gamelog/2016'
resultsPage = requests.get(url)
soup = BeautifulSoup(resultsPage.text, "html5lib")
table = soup.select_one("#gamelog")
print(table)

或仅使用 pandas:

 df = pd.read_html(url, attrs = {'id': 'gamelog'})

您的代码永远无法工作，因为您要查找的 NavigableString 位于标题标签 <caption>Regular Season Table</caption> 内，而不是table，您需要调用 *.find_previous`* 来获取 table:

comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "Regular Season  Table" in x)
table = comment.find_previous("table")

您也可以使用 table = comment.parent.parent，但 find_previous 是更好的方法。

将 Table 数据抓取到数据框中

Scrape Table Data Into Dataframe

python

beautifulsoup

bs4