为什么我不能抓取我想要的 table?

Why can't I webscrape the table that I want?

我是 BeautifulSoup 的新手,我想尝试一些网络抓取。对于我的小项目,我想从维基百科获取金州勇士队的胜率。我正计划获取具有该信息的 table 并将其制作成熊猫,这样我就可以在多年来绘制它。但是,我的代码选择 Table 键 table 而不是季节 table。我知道这是因为它们是同一类型的 table (wikitable),但我不知道如何解决这个问题。我确信我缺少一个简单的解释。有人可以解释一下如何修复我的代码,并解释一下我将来如何选择哪些 table 来进行网络抓取吗?谢谢!

c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
  cells=row.findAll('td')
  if len(cells)==13:
    c_year = c_year.append(cells[0].find(text=True))
    c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)

使用pd.read_html得到所有tables

  • 这个函数returns数据帧列表
    • tables[0]tables[17],在本例中为
import pandas as pd

# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')

print(len(tables))
>>> 18

tables[0]
          0                                             1
0       AHC                  NBA All-Star Game Head Coach
1      AMVP            All-Star Game Most Valuable Player
2       COY                             Coach of the Year
3      DPOY                  Defensive Player of the Year
4    Finish          Final position in division standings
5        GB  Games behind first-place team in division[b]
6   Italics                            Season in progress
7    Losses               Number of regular season losses
8       EOY                         Executive of the Year
9      FMVP                   Finals Most Valuable Player
10      MVP                          Most Valuable Player
11      ROY                            Rookie of the Year
12      SIX                         Sixth Man of the Year
13     SPOR                           Sportsmanship Award
14     Wins                 Number of regular season wins

# display all dataframes in tables
for i, table in enumerate(tables):
    print(f'Table {i}')
    display(table)
    print('\n')

Select具体table

df_i_want = tables[x]  # x is the specified table, 0 indexed

# delete tables
del(tables)