Python:BeautifulSoup 结合不同的 table headers 来自相同的 table

Python: BeautifulSoup combine different table headers from same table

Python 的新手,所以这可能是一个基本问题,但我有以下 table:

https://www.sports-reference.com/cfb/years/1991-passing.html

我想用 BeautifulSoup 抓取它以获得如下输出:

Player School Conf all the way to TD under Rushing
Ty Detmer Brigham Young WAC 7
Player Two School Two Conf 2 5

问题 1:如果您查看上面的 URL,每第 21 行是一个应该被忽略的 header 行 问题 2:“Rushing”似乎是另一个 th 所以我的代码和下面的输出目前是这样的:

import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup

data_universe = {}
years = list(range(1990,1991))
COLUMNS = ['Player', 'School', 'Conf', 'G', 'Cmp', 'Att', 'Pct', 'Yds', 'Y/A', 'AY/A', 'TD', 'Int', 'Rate', 'Rush_Att', 'Rush_Yds', 'Avg', 'Rush_TD']

for year in years:

  url = 'https://www.sports-reference.com/cfb/years/%s-passing.html' %year
  r = requests.get(url)
  soup = BeautifulSoup(r.content, 'html.parser')
  parsed_table = soup.find_all('table')[0]
  rows = parsed_table.find_all("tr") 

  cy_data = []
  for row in rows[2:]:
      cells = row.find_all("td")
      cells = cells[0:18] 
      cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
  cy_data = pd.DataFrame(cy_data, columns=COLUMNS)
  pd.set_option("display.max_rows", None, "display.max_columns", None)
  print(cy_data.head())

输出:

如何才能像在网站上一样在数据框中整齐地格式化此 table?

您可以使用 read_html 将 html table 直接加载到 pandas,无需使用 BeautifulSoup。然后,您可以通过删除顶部 header 行和 mid-table header 行来处理数据框:

df = pd.read_html('https://www.sports-reference.com/cfb/years/1991-passing.html')[0]
df.columns = df.columns.droplevel(0) # drop top header row
df = df[df['Rk'].ne('Rk')] # remove mid-table header rows 

输出:

|    |   Rk | Player         | School        | Conf     |   G |   Cmp |   Att |   Pct |   Yds |   Y/A |   AY/A |   TD |   Int |   Rate |   Att |   Yds |   Avg |   TD |
|---:|-----:|:---------------|:--------------|:---------|----:|------:|------:|------:|------:|------:|-------:|-----:|------:|-------:|------:|------:|------:|-----:|
|  0 |    1 | Ty Detmer      | Brigham Young | WAC      |  12 |   249 |   403 |  61.8 |  4031 |  10   |   10.4 |   35 |    12 |  168.5 |    75 |   -30 |  -0.4 |    4 |
|  1 |    2 | Rick Mirer     | Notre Dame    | Ind      |  12 |   132 |   234 |  56.4 |  2117 |   9   |    8.7 |   18 |    10 |  149.2 |    75 |   306 |   4.1 |    9 |
|  2 |    3 | J.J. Joe       | Baylor        | SWC      |  11 |   109 |   206 |  52.9 |  1853 |   9   |    7.9 |    7 |     8 |  131.9 |   116 |   147 |   1.3 |    6 |
|  3 |    4 | Shane Matthews | Florida       | SEC      |  11 |   218 |   361 |  60.4 |  3130 |   8.7 |    8   |   28 |    18 |  148.8 |    50 |    10 |   0.2 |    1 |
|  4 |    5 | Marvin Graves  | Syracuse      | Big East |  11 |   131 |   221 |  59.3 |  1912 |   8.7 |    7.3 |   10 |    11 |  136.9 |    99 |  -148 |  -1.5 |    1 |