BeautifulSoup 和 urlopen 没有获取正确的 table

BeautifulSoup and urlopen aren't fetching the right table

我正在尝试使用 Basketball-Reference 数据集练习 BeautifulSoup 和 urlopen。当我尝试获取单个球员的统计数据时,一切正常,但后来我尝试对团队的统计数据使用相同的代码,显然 urlopen 找不到正确的 table.

下面的代码是从页面中获取"headers"。


def fetch_years():

  #Determine the urls
  url = "https://www.basketball-reference.com/leagues/NBA_2000.html?sr&utm_source=direct&utm_medium=Share&utm_campaign=ShareTool#team-stats-per_game::none"

  html = urlopen(url)

  soup = BeautifulSoup(html)

  soup.find_all('tr')

  headers = [th.get_text() for th in soup.find_all('tr')[0].find_all('th')]
  headers = headers[1:]
  print(headers)

我正在尝试获取球队每场比赛的统计数据,格式如下:

['Tm', 'G', 'MP', 'FG', ...]

相反,我得到的 header 数据是:

['W', 'L', 'W/L%', ...] 

这是关于球队的 1999-2000 season 信息中的第一个 table(在名称 'Division Standings' 下)。

如果您对玩家的数据使用相同的代码,例如 this one,您会得到我正在寻找的结果:

  Age   Tm   Lg Pos   G  GS    MP   FG  ...  DRB  TRB  AST  STL  BLK  TOV   PF   PTS
0  20  OKC  NBA  PG  82  65  32.5  5.3  ...  2.7  4.9  5.3  1.3  0.2  3.3  2.3  15.3
1  21  OKC  NBA  PG  82  82  34.3  5.9  ...  3.1  4.9  8.0  1.3  0.4  3.3  2.5  16.1
2  22  OKC  NBA  PG  82  82  34.7  7.5  ...  3.1  4.6  8.2  1.9  0.4  3.9  2.5  21.9
3  23  OKC  NBA  PG  66  66  35.3  8.8  ...  3.1  4.6  5.5  1.7  0.3  3.6  2.2  23.6
4  24  OKC  NBA  PG  82  82  34.9  8.2  ...  3.9  5.2  7.4  1.8  0.3  3.3  2.3  23.2

webscrape 的代码最初来自 here

体育 -reference.com 网站比您的标准网站更棘手。 table 是在加载页面后呈现的(页面上有几个 table 除外),因此您需要使用 Selenium 让它先呈现,然后拉 html源代码。

但是,另一种选择是,如果您查看 html 来源,您会看到那些 table 位于评论中。您可以使用 BeautifulSoup 提取评论标签,然后在其中搜索 table 标签。

这将 return 一个数据框列表,每场比赛的团队数据是索引位置 1 中的 table:

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd

def fetch_years():

    #Determine the urls
    url = "https://www.basketball-reference.com/leagues/NBA_2000.html?sr&utm_source=direct&utm_medium=Share&utm_campaign=ShareTool#team-stats-per_game::none"
    html = requests.get(url)

    soup = BeautifulSoup(html.text)
    comments = soup.find_all(string=lambda text: isinstance(text, Comment))

    tables = []
    for each in comments:
        if 'table' in each:
            try:
                tables.append(pd.read_html(each)[0])
            except:
                continue
    return tables

tables = fetch_years()

输出:

print (tables[1].to_string())
      Rk                     Team   G     MP    FG   FGA    FG%   3P   3PA    3P%    2P   2PA    2P%    FT   FTA    FT%   ORB   DRB   TRB   AST  STL  BLK   TOV    PF    PTS
0    1.0        Sacramento Kings*  82  241.5  40.0  88.9  0.450  6.5  20.2  0.322  33.4  68.7  0.487  18.5  24.6  0.754  12.9  32.1  45.0  23.8  9.6  4.6  16.2  21.1  105.0
1    2.0         Detroit Pistons*  82  241.8  37.1  80.9  0.459  5.4  14.9  0.359  31.8  66.0  0.481  23.9  30.6  0.781  11.2  30.0  41.2  20.8  8.1  3.3  15.7  24.5  103.5
2    3.0         Dallas Mavericks  82  240.6  39.0  85.9  0.453  6.3  16.2  0.391  32.6  69.8  0.468  17.2  21.4  0.804  11.4  29.8  41.2  22.1  7.2  5.1  13.7  21.6  101.4
3    4.0          Indiana Pacers*  82  240.6  37.2  81.0  0.459  7.1  18.1  0.392  30.0  62.8  0.478  19.9  24.5  0.811  10.3  31.9  42.1  22.6  6.8  5.1  14.1  21.8  101.3
4    5.0         Milwaukee Bucks*  82  242.1  38.7  83.3  0.465  4.8  13.0  0.369  33.9  70.2  0.483  19.0  24.2  0.786  12.4  28.9  41.3  22.6  8.2  4.6  15.0  24.6  101.2
5    6.0      Los Angeles Lakers*  82  241.5  38.3  83.4  0.459  4.2  12.8  0.329  34.1  70.6  0.482  20.1  28.9  0.696  13.6  33.4  47.0  23.4  7.5  6.5  13.9  22.5  100.8
6    7.0            Orlando Magic  82  240.9  38.6  85.5  0.452  3.6  10.6  0.338  35.1  74.9  0.468  19.2  26.1  0.735  14.0  31.0  44.9  20.8  9.1  5.7  17.6  24.0  100.1
7    8.0          Houston Rockets  82  241.8  36.6  81.3  0.450  7.1  19.8  0.358  29.5  61.5  0.480  19.2  26.2  0.733  12.3  31.5  43.8  21.6  7.5  5.3  17.4  20.3   99.5
8    9.0           Boston Celtics  82  240.6  37.2  83.9  0.444  5.1  15.4  0.331  32.2  68.5  0.469  19.8  26.5  0.745  13.5  29.5  43.0  21.2  9.7  3.5  15.4  27.1   99.3
9   10.0     Seattle SuperSonics*  82  241.2  37.9  84.7  0.447  6.7  19.6  0.339  31.2  65.1  0.480  16.6  23.9  0.695  12.7  30.3  43.0  22.9  8.0  4.2  14.0  21.7   99.1
10  11.0           Denver Nuggets  82  242.1  37.3  84.3  0.442  5.7  17.0  0.336  31.5  67.2  0.469  18.7  25.8  0.724  13.1  31.6  44.7  23.3  6.8  7.5  15.6  23.9   99.0
11  12.0            Phoenix Suns*  82  241.5  37.7  82.6  0.457  5.6  15.2  0.368  32.1  67.4  0.477  17.9  23.6  0.759  12.5  31.2  43.7  25.6  9.1  5.3  16.7  24.1   98.9
12  13.0  Minnesota Timberwolves*  82  242.7  39.3  84.3  0.467  3.0   8.7  0.346  36.3  75.5  0.481  16.8  21.6  0.780  12.4  30.1  42.5  26.9  7.6  5.4  13.9  23.3   98.5
13  14.0       Charlotte Hornets*  82  241.2  35.8  79.7  0.449  4.1  12.2  0.339  31.7  67.5  0.469  22.7  30.0  0.758  10.8  32.1  42.9  24.7  8.9  5.9  14.7  20.4   98.4
14  15.0          New Jersey Nets  82  241.8  36.3  83.9  0.433  5.8  16.8  0.347  30.5  67.2  0.454  19.5  24.9  0.784  12.7  28.2  40.9  20.6  8.8  4.8  13.6  23.3   98.0
15  16.0  Portland Trail Blazers*  82  241.2  36.8  78.4  0.470  5.0  13.8  0.361  31.9  64.7  0.493  18.8  24.7  0.760  11.8  31.2  43.0  23.5  7.7  4.8  15.2  22.7   97.5
16  17.0         Toronto Raptors*  82  240.9  36.3  83.9  0.433  5.2  14.3  0.363  31.2  69.6  0.447  19.3  25.2  0.765  13.4  29.9  43.3  23.7  8.1  6.6  13.9  24.3   97.2
17  18.0      Cleveland Cavaliers  82  242.1  36.3  82.1  0.442  4.2  11.2  0.373  32.1  70.9  0.453  20.2  26.9  0.750  12.3  30.5  42.8  23.7  8.7  4.4  17.4  27.1   97.0
18  19.0       Washington Wizards  82  241.5  36.7  81.5  0.451  4.1  10.9  0.376  32.6  70.6  0.462  19.1  25.7  0.743  13.0  29.7  42.7  21.6  7.2  4.7  16.1  26.2   96.6
19  20.0               Utah Jazz*  82  240.9  36.1  77.8  0.464  4.0  10.4  0.385  32.1  67.4  0.476  20.3  26.2  0.773  11.4  29.6  41.0  24.9  7.7  5.4  14.9  24.5   96.5
20  21.0       San Antonio Spurs*  82  242.1  36.0  78.0  0.462  4.0  10.8  0.374  32.0  67.2  0.476  20.1  27.0  0.746  11.3  32.5  43.8  22.2  7.5  6.7  15.0  20.9   96.2
21  22.0    Golden State Warriors  82  240.9  36.5  87.1  0.420  4.2  13.0  0.323  32.3  74.0  0.437  18.3  26.2  0.697  15.9  29.7  45.6  22.6  8.9  4.3  15.9  24.9   95.5
22  23.0      Philadelphia 76ers*  82  241.8  36.5  82.6  0.442  2.5   7.8  0.323  34.0  74.8  0.454  19.2  27.1  0.708  14.0  30.1  44.1  22.2  9.6  4.7  15.7  23.6   94.8
23  24.0              Miami Heat*  82  241.8  36.3  78.8  0.460  5.4  14.7  0.371  30.8  64.1  0.481  16.4  22.3  0.736  11.2  31.9  43.2  23.5  7.1  6.4  15.0  23.7   94.4
24  25.0            Atlanta Hawks  82  241.8  36.6  83.0  0.441  3.1   9.9  0.317  33.4  73.1  0.458  18.0  24.2  0.743  14.0  31.3  45.3  18.9  6.1  5.6  15.4  21.0   94.3
25  26.0      Vancouver Grizzlies  82  242.1  35.3  78.5  0.449  4.0  11.0  0.361  31.3  67.6  0.463  19.4  25.1  0.774  12.3  28.3  40.6  20.7  7.4  4.2  16.8  22.9   93.9
26  27.0         New York Knicks*  82  241.8  35.3  77.7  0.455  4.3  11.4  0.375  31.0  66.3  0.468  17.2  22.0  0.781   9.8  30.7  40.5  19.4  6.3  4.3  14.6  24.2   92.1
27  28.0     Los Angeles Clippers  82  240.3  35.1  82.4  0.426  5.2  15.5  0.339  29.9  67.0  0.446  16.6  22.3  0.746  11.6  29.0  40.6  18.0  7.0  6.0  16.2  22.2   92.0
28  29.0            Chicago Bulls  82  241.5  31.3  75.4  0.415  4.1  12.6  0.329  27.1  62.8  0.432  18.1  25.5  0.709  12.6  28.3  40.9  20.1  7.9  4.7  19.0  23.3   84.8
29   NaN           League Average  82  241.5  36.8  82.1  0.449  4.8  13.7  0.353  32.0  68.4  0.468  19.0  25.3  0.750  12.4  30.5  42.9  22.3  7.9  5.2  15.5  23.3   97.5