为什么我不能抓取我想要的 table?
Why can't I webscrape the table that I want?
我是 BeautifulSoup 的新手,我想尝试一些网络抓取。对于我的小项目,我想从维基百科获取金州勇士队的胜率。我正计划获取具有该信息的 table 并将其制作成熊猫,这样我就可以在多年来绘制它。但是,我的代码选择 Table 键 table 而不是季节 table。我知道这是因为它们是同一类型的 table (wikitable),但我不知道如何解决这个问题。我确信我缺少一个简单的解释。有人可以解释一下如何修复我的代码,并解释一下我将来如何选择哪些 table 来进行网络抓取吗?谢谢!
c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
cells=row.findAll('td')
if len(cells)==13:
c_year = c_year.append(cells[0].find(text=True))
c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)
使用pd.read_html
得到所有tables
- 这个函数returns数据帧列表
tables[0]
至 tables[17]
,在本例中为
import pandas as pd
# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')
print(len(tables))
>>> 18
tables[0]
0 1
0 AHC NBA All-Star Game Head Coach
1 AMVP All-Star Game Most Valuable Player
2 COY Coach of the Year
3 DPOY Defensive Player of the Year
4 Finish Final position in division standings
5 GB Games behind first-place team in division[b]
6 Italics Season in progress
7 Losses Number of regular season losses
8 EOY Executive of the Year
9 FMVP Finals Most Valuable Player
10 MVP Most Valuable Player
11 ROY Rookie of the Year
12 SIX Sixth Man of the Year
13 SPOR Sportsmanship Award
14 Wins Number of regular season wins
# display all dataframes in tables
for i, table in enumerate(tables):
print(f'Table {i}')
display(table)
print('\n')
Select具体table
df_i_want = tables[x] # x is the specified table, 0 indexed
# delete tables
del(tables)
我是 BeautifulSoup 的新手,我想尝试一些网络抓取。对于我的小项目,我想从维基百科获取金州勇士队的胜率。我正计划获取具有该信息的 table 并将其制作成熊猫,这样我就可以在多年来绘制它。但是,我的代码选择 Table 键 table 而不是季节 table。我知道这是因为它们是同一类型的 table (wikitable),但我不知道如何解决这个问题。我确信我缺少一个简单的解释。有人可以解释一下如何修复我的代码,并解释一下我将来如何选择哪些 table 来进行网络抓取吗?谢谢!
c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
cells=row.findAll('td')
if len(cells)==13:
c_year = c_year.append(cells[0].find(text=True))
c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)
使用pd.read_html
得到所有tables
- 这个函数returns数据帧列表
tables[0]
至tables[17]
,在本例中为
import pandas as pd
# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')
print(len(tables))
>>> 18
tables[0]
0 1
0 AHC NBA All-Star Game Head Coach
1 AMVP All-Star Game Most Valuable Player
2 COY Coach of the Year
3 DPOY Defensive Player of the Year
4 Finish Final position in division standings
5 GB Games behind first-place team in division[b]
6 Italics Season in progress
7 Losses Number of regular season losses
8 EOY Executive of the Year
9 FMVP Finals Most Valuable Player
10 MVP Most Valuable Player
11 ROY Rookie of the Year
12 SIX Sixth Man of the Year
13 SPOR Sportsmanship Award
14 Wins Number of regular season wins
# display all dataframes in tables
for i, table in enumerate(tables):
print(f'Table {i}')
display(table)
print('\n')
Select具体table
df_i_want = tables[x] # x is the specified table, 0 indexed
# delete tables
del(tables)