在循环层后附加 pandas 个数据帧
Appending pandas dataframes after looping through layers
我需要一些帮助来完成格式化 API 结果以导入到 PostgreSQL 数据库的最后一步。数据结构为:
[{ "season": 0,
"seasonType": "string",
"week": 0,
"polls": [
{
"poll": "string",
"ranks": [
{
"rank": 0,
"school": "string",
"conference": "string",
"firstPlaceVotes": 0,
"points": 0 }]}]}]
这是我用来解压它的代码(当然,如果有更好、更有效的方法来做到这一点,请洗耳恭听):
year = list(range(2020,2021))
req = []
pbp = pd.DataFrame()
headers = {
"Accept": "application/json",
"Authorization": "Bearer 1a2b3c4d"}
for year in tqdm(year, desc = 'fetch record'):
parameters = {"year":year, "seasonType":"regular"}
req = requests.get("https://api.collegefootballdata.com/rankings", headers=headers, params = parameters)
r = req.json()
print(type(r))
df1 = pd.DataFrame(r, columns = ['season', 'seasonType', 'week'], dtype = int)
pbp = pbp.append(json.loads(req.text))
for polls in pbp["polls"]:
try:
p1 = polls[1]
except IndexError:
continue
df2 = pd.DataFrame.from_dict(p1)
poll = df2["poll"]
y = df1.append(poll)
for rank in df2["ranks"]:
df3 = pd.DataFrame(rank, index=[0])
z = y.append(df3)
当我追加时,数据出来是这样的:
year
season
week
poll
rank
team
2020
regular
1
2020
regular
2
AP
1
Alabama
2020
regular
1
2020
regular
2
AP
2
Clemson
而且,我希望它看起来像这样:
year
season
week
poll
rank
team
2020
regular
1
AP
1
Alabama
2020
regular
1
AP
2
Clemson
2020
regular
2
AP
1
Alabama
2020
regular
2
AP
2
Clemson
问题是你用的太多了append()
。
您应该先创建 list/dictionary 行中的所有值,最后追加此行。
import pandas as pd
from tqdm import tqdm
import requests
year = range(2020, 2021)
df = pd.DataFrame()
headers = {
"Accept": "application/json",
"Authorization": "Bearer XXXXX"
}
for year in tqdm(year, desc='fetch record'):
parameters = {
"year": year,
"seasonType": "regular"
}
url = "https://api.collegefootballdata.com/rankings"
response = requests.get(url, params=parameters, headers=headers)
data = response.json()
#print(data[0])
for item in data:
row = {
'year': item['season'],
'season': item['seasonType'],
'week': item['week'],
}
for poll in item["polls"]:
row['poll'] = poll["poll"]
for rank in poll["ranks"]:
row['rank'] = rank["rank"]
row['team'] = rank["school"]
#print(row)
df = df.append(row, ignore_index=True)
print(df)
结果:
year season week poll rank team
0 2020.0 regular 1.0 AP Top 25 1.0 Clemson
1 2020.0 regular 1.0 AP Top 25 2.0 Ohio State
2 2020.0 regular 1.0 AP Top 25 3.0 Alabama
3 2020.0 regular 1.0 AP Top 25 4.0 Georgia
4 2020.0 regular 1.0 AP Top 25 5.0 Oklahoma
.. ... ... ... ... ... ...
845 2020.0 regular 16.0 Playoff Committee Rankings 21.0 Oklahoma State
846 2020.0 regular 16.0 Playoff Committee Rankings 22.0 NC State
847 2020.0 regular 16.0 Playoff Committee Rankings 23.0 Tulsa
848 2020.0 regular 16.0 Playoff Committee Rankings 24.0 San José State
849 2020.0 regular 16.0 Playoff Committee Rankings 25.0 Colorado
[850 rows x 9 columns]
编辑
同样使用特殊函数,如.read_json()
、.explode()
.apply(pd.Series)
# ... code ...
response = requests.get(url, params=parameters, headers=headers)
df = pd.read_json(response.text)
df = df.explode(['polls'])
df['poll'] = df['polls'].str['poll']
df['ranks'] = df['polls'].str['ranks']
df = df.explode(['ranks'])
df = df.reset_index(drop=True)
df = df.join(df['ranks'].apply(pd.Series))
df.drop(columns=['polls', 'ranks'], inplace=True)
print(df)
结果:
season seasonType week ... conference firstPlaceVotes points
0 2020 regular 1 ... ACC 38.0 1520.0
1 2020 regular 1 ... Big Ten 21.0 1504.0
2 2020 regular 1 ... SEC 2.0 1422.0
3 2020 regular 1 ... SEC 0.0 1270.0
4 2020 regular 1 ... Big 12 0.0 1269.0
.. ... ... ... ... ... ... ...
845 2020 regular 16 ... Big 12 NaN NaN
846 2020 regular 16 ... ACC NaN NaN
847 2020 regular 16 ... American Athletic NaN NaN
848 2020 regular 16 ... Mountain West NaN NaN
849 2020 regular 16 ... Pac-12 NaN NaN
[850 rows x 9 columns]
我需要一些帮助来完成格式化 API 结果以导入到 PostgreSQL 数据库的最后一步。数据结构为:
[{ "season": 0,
"seasonType": "string",
"week": 0,
"polls": [
{
"poll": "string",
"ranks": [
{
"rank": 0,
"school": "string",
"conference": "string",
"firstPlaceVotes": 0,
"points": 0 }]}]}]
这是我用来解压它的代码(当然,如果有更好、更有效的方法来做到这一点,请洗耳恭听):
year = list(range(2020,2021))
req = []
pbp = pd.DataFrame()
headers = {
"Accept": "application/json",
"Authorization": "Bearer 1a2b3c4d"}
for year in tqdm(year, desc = 'fetch record'):
parameters = {"year":year, "seasonType":"regular"}
req = requests.get("https://api.collegefootballdata.com/rankings", headers=headers, params = parameters)
r = req.json()
print(type(r))
df1 = pd.DataFrame(r, columns = ['season', 'seasonType', 'week'], dtype = int)
pbp = pbp.append(json.loads(req.text))
for polls in pbp["polls"]:
try:
p1 = polls[1]
except IndexError:
continue
df2 = pd.DataFrame.from_dict(p1)
poll = df2["poll"]
y = df1.append(poll)
for rank in df2["ranks"]:
df3 = pd.DataFrame(rank, index=[0])
z = y.append(df3)
当我追加时,数据出来是这样的:
year | season | week | poll | rank | team |
---|---|---|---|---|---|
2020 | regular | 1 | |||
2020 | regular | 2 | |||
AP | |||||
1 | Alabama | ||||
2020 | regular | 1 | |||
2020 | regular | 2 | |||
AP | |||||
2 | Clemson |
而且,我希望它看起来像这样:
year | season | week | poll | rank | team |
---|---|---|---|---|---|
2020 | regular | 1 | AP | 1 | Alabama |
2020 | regular | 1 | AP | 2 | Clemson |
2020 | regular | 2 | AP | 1 | Alabama |
2020 | regular | 2 | AP | 2 | Clemson |
问题是你用的太多了append()
。
您应该先创建 list/dictionary 行中的所有值,最后追加此行。
import pandas as pd
from tqdm import tqdm
import requests
year = range(2020, 2021)
df = pd.DataFrame()
headers = {
"Accept": "application/json",
"Authorization": "Bearer XXXXX"
}
for year in tqdm(year, desc='fetch record'):
parameters = {
"year": year,
"seasonType": "regular"
}
url = "https://api.collegefootballdata.com/rankings"
response = requests.get(url, params=parameters, headers=headers)
data = response.json()
#print(data[0])
for item in data:
row = {
'year': item['season'],
'season': item['seasonType'],
'week': item['week'],
}
for poll in item["polls"]:
row['poll'] = poll["poll"]
for rank in poll["ranks"]:
row['rank'] = rank["rank"]
row['team'] = rank["school"]
#print(row)
df = df.append(row, ignore_index=True)
print(df)
结果:
year season week poll rank team
0 2020.0 regular 1.0 AP Top 25 1.0 Clemson
1 2020.0 regular 1.0 AP Top 25 2.0 Ohio State
2 2020.0 regular 1.0 AP Top 25 3.0 Alabama
3 2020.0 regular 1.0 AP Top 25 4.0 Georgia
4 2020.0 regular 1.0 AP Top 25 5.0 Oklahoma
.. ... ... ... ... ... ...
845 2020.0 regular 16.0 Playoff Committee Rankings 21.0 Oklahoma State
846 2020.0 regular 16.0 Playoff Committee Rankings 22.0 NC State
847 2020.0 regular 16.0 Playoff Committee Rankings 23.0 Tulsa
848 2020.0 regular 16.0 Playoff Committee Rankings 24.0 San José State
849 2020.0 regular 16.0 Playoff Committee Rankings 25.0 Colorado
[850 rows x 9 columns]
编辑
同样使用特殊函数,如.read_json()
、.explode()
.apply(pd.Series)
# ... code ...
response = requests.get(url, params=parameters, headers=headers)
df = pd.read_json(response.text)
df = df.explode(['polls'])
df['poll'] = df['polls'].str['poll']
df['ranks'] = df['polls'].str['ranks']
df = df.explode(['ranks'])
df = df.reset_index(drop=True)
df = df.join(df['ranks'].apply(pd.Series))
df.drop(columns=['polls', 'ranks'], inplace=True)
print(df)
结果:
season seasonType week ... conference firstPlaceVotes points
0 2020 regular 1 ... ACC 38.0 1520.0
1 2020 regular 1 ... Big Ten 21.0 1504.0
2 2020 regular 1 ... SEC 2.0 1422.0
3 2020 regular 1 ... SEC 0.0 1270.0
4 2020 regular 1 ... Big 12 0.0 1269.0
.. ... ... ... ... ... ... ...
845 2020 regular 16 ... Big 12 NaN NaN
846 2020 regular 16 ... ACC NaN NaN
847 2020 regular 16 ... American Athletic NaN NaN
848 2020 regular 16 ... Mountain West NaN NaN
849 2020 regular 16 ... Pac-12 NaN NaN
[850 rows x 9 columns]