Python 从网页中拉取 html table
Python pulling html table from webpage
table on this page 需要每天抓取。我们正在努力使抓取尽可能简单(稳健),因此我们服务器上的代码 运行 没有问题。想避开 Selenium:
import requests
import pandas as pd
page_list = pd.read_html('https://www.ncaa.com/rankings/basketball-women/d1/ncaa-womens-basketball-net-rankings')
page_df = pd.DataFrame(page_list)
# won't convert to df (ValueError: Must pass 2-d input. shape=(1, 356, 9)
r = requests.get('https://www.ncaa.com/rankings/basketball-women/d1/ncaa-womens-basketball-net-rankings')
# not sure what to do with response
page_list
很接近,但它是一个 3 维列表。我们如何将其放入二维列表或 pandas 数据框?
pd.read_html
不是 return DataFrame 而是数据帧列表。使用 page_list[0]
获取第一个数据帧:
page_df = pd.DataFrame(page_list[0])
Read HTML tables into a list of DataFrame objects.
不需要page_df = pd.DataFrame(page_list[0])
。实际上可以简单地这样到 page_df = page_list[0]
:
page_list = pd.read_html('https://www.ncaa.com/rankings/basketball-women/d1/ncaa-womens-basketball-net-rankings')
page_df = page_list[0]
table on this page 需要每天抓取。我们正在努力使抓取尽可能简单(稳健),因此我们服务器上的代码 运行 没有问题。想避开 Selenium:
import requests
import pandas as pd
page_list = pd.read_html('https://www.ncaa.com/rankings/basketball-women/d1/ncaa-womens-basketball-net-rankings')
page_df = pd.DataFrame(page_list)
# won't convert to df (ValueError: Must pass 2-d input. shape=(1, 356, 9)
r = requests.get('https://www.ncaa.com/rankings/basketball-women/d1/ncaa-womens-basketball-net-rankings')
# not sure what to do with response
page_list
很接近,但它是一个 3 维列表。我们如何将其放入二维列表或 pandas 数据框?
pd.read_html
不是 return DataFrame 而是数据帧列表。使用 page_list[0]
获取第一个数据帧:
page_df = pd.DataFrame(page_list[0])
Read HTML tables into a list of DataFrame objects.
不需要page_df = pd.DataFrame(page_list[0])
。实际上可以简单地这样到 page_df = page_list[0]
:
page_list = pd.read_html('https://www.ncaa.com/rankings/basketball-women/d1/ncaa-womens-basketball-net-rankings')
page_df = page_list[0]