Python:BeautifulSoup 结合不同的 table headers 来自相同的 table
Python: BeautifulSoup combine different table headers from same table
Python 的新手,所以这可能是一个基本问题,但我有以下 table:
https://www.sports-reference.com/cfb/years/1991-passing.html
我想用 BeautifulSoup
抓取它以获得如下输出:
Player
School
Conf
all the way to TD under Rushing
Ty Detmer
Brigham Young
WAC
7
Player Two
School Two
Conf 2
5
问题 1:如果您查看上面的 URL,每第 21 行是一个应该被忽略的 header 行
问题 2:“Rushing”似乎是另一个 th
所以我的代码和下面的输出目前是这样的:
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
data_universe = {}
years = list(range(1990,1991))
COLUMNS = ['Player', 'School', 'Conf', 'G', 'Cmp', 'Att', 'Pct', 'Yds', 'Y/A', 'AY/A', 'TD', 'Int', 'Rate', 'Rush_Att', 'Rush_Yds', 'Avg', 'Rush_TD']
for year in years:
url = 'https://www.sports-reference.com/cfb/years/%s-passing.html' %year
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
parsed_table = soup.find_all('table')[0]
rows = parsed_table.find_all("tr")
cy_data = []
for row in rows[2:]:
cells = row.find_all("td")
cells = cells[0:18]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
cy_data = pd.DataFrame(cy_data, columns=COLUMNS)
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(cy_data.head())
输出:
如何才能像在网站上一样在数据框中整齐地格式化此 table?
您可以使用 read_html
将 html table 直接加载到 pandas,无需使用 BeautifulSoup。然后,您可以通过删除顶部 header 行和 mid-table header 行来处理数据框:
df = pd.read_html('https://www.sports-reference.com/cfb/years/1991-passing.html')[0]
df.columns = df.columns.droplevel(0) # drop top header row
df = df[df['Rk'].ne('Rk')] # remove mid-table header rows
输出:
| | Rk | Player | School | Conf | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate | Att | Yds | Avg | TD |
|---:|-----:|:---------------|:--------------|:---------|----:|------:|------:|------:|------:|------:|-------:|-----:|------:|-------:|------:|------:|------:|-----:|
| 0 | 1 | Ty Detmer | Brigham Young | WAC | 12 | 249 | 403 | 61.8 | 4031 | 10 | 10.4 | 35 | 12 | 168.5 | 75 | -30 | -0.4 | 4 |
| 1 | 2 | Rick Mirer | Notre Dame | Ind | 12 | 132 | 234 | 56.4 | 2117 | 9 | 8.7 | 18 | 10 | 149.2 | 75 | 306 | 4.1 | 9 |
| 2 | 3 | J.J. Joe | Baylor | SWC | 11 | 109 | 206 | 52.9 | 1853 | 9 | 7.9 | 7 | 8 | 131.9 | 116 | 147 | 1.3 | 6 |
| 3 | 4 | Shane Matthews | Florida | SEC | 11 | 218 | 361 | 60.4 | 3130 | 8.7 | 8 | 28 | 18 | 148.8 | 50 | 10 | 0.2 | 1 |
| 4 | 5 | Marvin Graves | Syracuse | Big East | 11 | 131 | 221 | 59.3 | 1912 | 8.7 | 7.3 | 10 | 11 | 136.9 | 99 | -148 | -1.5 | 1 |
Python 的新手,所以这可能是一个基本问题,但我有以下 table:
https://www.sports-reference.com/cfb/years/1991-passing.html
我想用 BeautifulSoup
抓取它以获得如下输出:
Player | School | Conf | all the way to TD under Rushing |
---|---|---|---|
Ty Detmer | Brigham Young | WAC | 7 |
Player Two | School Two | Conf 2 | 5 |
问题 1:如果您查看上面的 URL,每第 21 行是一个应该被忽略的 header 行
问题 2:“Rushing”似乎是另一个 th
所以我的代码和下面的输出目前是这样的:
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
data_universe = {}
years = list(range(1990,1991))
COLUMNS = ['Player', 'School', 'Conf', 'G', 'Cmp', 'Att', 'Pct', 'Yds', 'Y/A', 'AY/A', 'TD', 'Int', 'Rate', 'Rush_Att', 'Rush_Yds', 'Avg', 'Rush_TD']
for year in years:
url = 'https://www.sports-reference.com/cfb/years/%s-passing.html' %year
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
parsed_table = soup.find_all('table')[0]
rows = parsed_table.find_all("tr")
cy_data = []
for row in rows[2:]:
cells = row.find_all("td")
cells = cells[0:18]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
cy_data = pd.DataFrame(cy_data, columns=COLUMNS)
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(cy_data.head())
输出:
如何才能像在网站上一样在数据框中整齐地格式化此 table?
您可以使用 read_html
将 html table 直接加载到 pandas,无需使用 BeautifulSoup。然后,您可以通过删除顶部 header 行和 mid-table header 行来处理数据框:
df = pd.read_html('https://www.sports-reference.com/cfb/years/1991-passing.html')[0]
df.columns = df.columns.droplevel(0) # drop top header row
df = df[df['Rk'].ne('Rk')] # remove mid-table header rows
输出:
| | Rk | Player | School | Conf | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate | Att | Yds | Avg | TD |
|---:|-----:|:---------------|:--------------|:---------|----:|------:|------:|------:|------:|------:|-------:|-----:|------:|-------:|------:|------:|------:|-----:|
| 0 | 1 | Ty Detmer | Brigham Young | WAC | 12 | 249 | 403 | 61.8 | 4031 | 10 | 10.4 | 35 | 12 | 168.5 | 75 | -30 | -0.4 | 4 |
| 1 | 2 | Rick Mirer | Notre Dame | Ind | 12 | 132 | 234 | 56.4 | 2117 | 9 | 8.7 | 18 | 10 | 149.2 | 75 | 306 | 4.1 | 9 |
| 2 | 3 | J.J. Joe | Baylor | SWC | 11 | 109 | 206 | 52.9 | 1853 | 9 | 7.9 | 7 | 8 | 131.9 | 116 | 147 | 1.3 | 6 |
| 3 | 4 | Shane Matthews | Florida | SEC | 11 | 218 | 361 | 60.4 | 3130 | 8.7 | 8 | 28 | 18 | 148.8 | 50 | 10 | 0.2 | 1 |
| 4 | 5 | Marvin Graves | Syracuse | Big East | 11 | 131 | 221 | 59.3 | 1912 | 8.7 | 7.3 | 10 | 11 | 136.9 | 99 | -148 | -1.5 | 1 |