如何在团队之间头对头地填充 df?
How to populate a df with rolling head to head between teams?
我有一个 df,其中包含有关球队之间比赛的数据,我想创建一个新列,其中包含比赛前球队之间的 h2h 记录。
例如:
df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'],
['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'],
['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022']],
columns = ['winner', 'loser', 'won', 'date'])
在此示例中,每场比赛前的交锋应该是:0-0、1-0、2-0、1-2、2-2、3-3、3-4
我想计算 h2h % wins,但我想得到一个团队对另一个团队的胜利次数是第一步。我可以用 groupby 计算最终的 h2h,但我不确定如何计算每场比赛,因为一个团队可能在两列之一中。请注意,此 df 的格式遵循 winner/loser 格式,因此 'won' 始终为 1。或者,我可以将 df 更改为长版本(一个匹配 = 两行)但不确定是否有帮助.我还有其他专栏,但我不确定它们是否与这个问题相关(更多统计信息、ID 等)。
根据@拟人的回复,我可以做以下事情:
df['winner_wins'] = df.groupby(['winner', 'loser'])['won'].cumsum()
df['winner_wins'] = df.groupby(['winner', 'loser'])['winner_wins'].shift(1)
在赛前准确记录 'winner' 球队的胜场数。但我不知道我应该如何为 'loser' 团队
获得相同的东西
如果我对你的问题理解正确,cumsum
和 expanding
方法可能对你有用。
代码:
import pandas as pd
# Create a sample dataframe
df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'], ['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'], ['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022']], columns = ['winner', 'loser', 'won', 'date'])
# Calculate h2h records
df = df.sort_values('date').assign(
LAC_h2h_wins=(df.winner=='LAC').cumsum(),
LAL_h2h_wins=(df.winner=='LAL').cumsum(),
LAC_h2h_wins_pct=(df.winner=='LAC').expanding().agg(lambda s: 100 * s.sum() / len(s)),
LAL_h2h_wins_pct=(df.winner=='LAL').expanding().agg(lambda s: 100 * s.sum() / len(s)),
)
print(df)
输出:
winner
loser
won
date
LAC_h2h_wins
LAL_h2h_wins
LAC_h2h_wins_pct
LAL_h2h_wins_pct
0
LAC
LAL
1
15/02/2022
1
0
100
0
1
LAC
LAL
1
16/02/2022
2
0
100
0
2
LAL
LAC
1
17/02/2022
2
1
66.6667
33.3333
3
LAL
LAC
1
18/02/2022
2
2
50
50
4
LAL
LAC
1
19/02/2022
2
3
40
60
5
LAC
LAL
1
20/02/2022
3
3
50
50
6
LAL
LAC
1
21/02/2022
3
4
42.8571
57.1429
7
LAC
LAL
1
22/02/2022
4
4
50
50
[编辑]
回答 OP 的评论。
代码:
import pandas as pd
# Create a sample dataframe with more data points
df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'], ['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'], ['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022'], ['ABC','LAL', 1, '15/02/2022'], ['ABC','LAL', 1, '16/02/2022'], ['LAL','ABC', 1, '17/02/2022'], ['LAL','ABC', 1, '18/02/2022'], ['LAL','ABC', 1, '19/02/2022'], ['ABC','LAL', 1, '20/02/2022'], ['LAL','ABC', 1, '21/02/2022'], ['ABC','LAL', 1, '22/02/2022'], ['ABC','XYZ', 1, '15/02/2022'], ['ABC','XYZ', 1, '16/02/2022'], ['XYZ','ABC', 1, '17/02/2022'], ['XYZ','ABC', 1, '18/02/2022'], ['XYZ','ABC', 1, '19/02/2022'], ['ABC','XYZ', 1, '20/02/2022'], ['XYZ','ABC', 1, '21/02/2022'], ['ABC','XYZ', 1, '22/02/2022'], ['LAC','XYZ', 1, '15/02/2022'], ['LAC','XYZ', 1, '16/02/2022'], ['XYZ','LAC', 1, '17/02/2022'], ['XYZ','LAC', 1, '18/02/2022'], ['XYZ','LAC', 1, '19/02/2022'], ['LAC','XYZ', 1, '20/02/2022'], ['XYZ','LAC', 1, '21/02/2022'], ['LAC','XYZ', 1, '22/02/2022']], columns = ['winner', 'loser', 'won', 'date'])
# In order to group by games, make sorted game titles like "LAC-LAL"
df['game'] = df.apply(lambda r: '-'.join(sorted([r.winner, r.loser])), axis=1)
# Ensure that df is sorted game and date (date must align in the ascending order)
df = df.sort_values(['game', 'date'], ignore_index=True)
# Assign 1 if the left team in the game title, otherwise 0. For example, "LAC" is the left team in the game title "LAC-LAL"
df['left_win'] = df.apply(lambda r: f'{r.winner}-{r.loser}'==r.game, axis=1)
# Do the same thing on the right team.
df['right_win'] = ~df.left_win
# Calculate the cumulative sumation.
df[['left_win_cumsum', 'right_win_cumsum']] = df.groupby('game')[['left_win', 'right_win']].cumsum()
# Shift and fill the first games as 0
df[['h2h_winner', 'h2h_loser']] = df.groupby('game')[['left_win_cumsum', 'right_win_cumsum']].shift().fillna(0).astype(int)
# Check the order in a pair of winner and loser columns. If the order is different from the game title, reverse the cumsum values
f = lambda r: [r.h2h_winner, r.h2h_loser] if f'{r.winner}-{r.loser}'==r.game else [r.h2h_loser, r.h2h_winner]
df[['h2h_winner', 'h2h_loser']] = df.apply(f, axis=1).apply(pd.Series)
# Drop all the temporary columns
df = df.drop(['game', 'left_win', 'right_win', 'left_win_cumsum', 'right_win_cumsum'], axis=1)
print(df.to_markdown(stralign='center', numalign='center'))
输出(仅提取 LAC - LAL 游戏):
winner
loser
won
date
h2h_winner
h2h_loser
16
LAC
LAL
1
15/02/2022
0
0
17
LAC
LAL
1
16/02/2022
1
0
18
LAL
LAC
1
17/02/2022
0
2
19
LAL
LAC
1
18/02/2022
1
2
20
LAL
LAC
1
19/02/2022
2
2
21
LAC
LAL
1
20/02/2022
2
3
22
LAL
LAC
1
21/02/2022
3
3
23
LAC
LAL
1
22/02/2022
3
4
尝试:
tmp = pd.crosstab(df.index, df["winner"]).shift(fill_value=0).cumsum()
# prevent error if there's a team that only wins:
tmp = tmp.merge(
pd.DataFrame(columns=np.unique(df[["winner", "loser"]])), how="outer"
).fillna(0)
df[["winner_cnt", "loser_cnt"]] = df.apply(
lambda x: tmp.loc[x.name, x[["winner", "loser"]].values].values, axis=1
).apply(pd.Series)
print(df)
打印:
winner loser won date winner_cnt loser_cnt
0 LAC LAL 1 15/02/2022 0 0
1 LAC LAL 1 16/02/2022 1 0
2 LAL LAC 1 17/02/2022 0 2
3 LAL LAC 1 18/02/2022 1 2
4 LAL LAC 1 19/02/2022 2 2
5 LAC LAL 1 20/02/2022 2 3
6 LAL LAC 1 21/02/2022 3 3
7 LAC LAL 1 22/02/2022 3 4
我有一个 df,其中包含有关球队之间比赛的数据,我想创建一个新列,其中包含比赛前球队之间的 h2h 记录。
例如:
df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'],
['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'],
['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022']],
columns = ['winner', 'loser', 'won', 'date'])
在此示例中,每场比赛前的交锋应该是:0-0、1-0、2-0、1-2、2-2、3-3、3-4
我想计算 h2h % wins,但我想得到一个团队对另一个团队的胜利次数是第一步。我可以用 groupby 计算最终的 h2h,但我不确定如何计算每场比赛,因为一个团队可能在两列之一中。请注意,此 df 的格式遵循 winner/loser 格式,因此 'won' 始终为 1。或者,我可以将 df 更改为长版本(一个匹配 = 两行)但不确定是否有帮助.我还有其他专栏,但我不确定它们是否与这个问题相关(更多统计信息、ID 等)。
根据@拟人的回复,我可以做以下事情:
df['winner_wins'] = df.groupby(['winner', 'loser'])['won'].cumsum()
df['winner_wins'] = df.groupby(['winner', 'loser'])['winner_wins'].shift(1)
在赛前准确记录 'winner' 球队的胜场数。但我不知道我应该如何为 'loser' 团队
获得相同的东西如果我对你的问题理解正确,cumsum
和 expanding
方法可能对你有用。
代码:
import pandas as pd
# Create a sample dataframe
df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'], ['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'], ['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022']], columns = ['winner', 'loser', 'won', 'date'])
# Calculate h2h records
df = df.sort_values('date').assign(
LAC_h2h_wins=(df.winner=='LAC').cumsum(),
LAL_h2h_wins=(df.winner=='LAL').cumsum(),
LAC_h2h_wins_pct=(df.winner=='LAC').expanding().agg(lambda s: 100 * s.sum() / len(s)),
LAL_h2h_wins_pct=(df.winner=='LAL').expanding().agg(lambda s: 100 * s.sum() / len(s)),
)
print(df)
输出:
winner | loser | won | date | LAC_h2h_wins | LAL_h2h_wins | LAC_h2h_wins_pct | LAL_h2h_wins_pct | |
---|---|---|---|---|---|---|---|---|
0 | LAC | LAL | 1 | 15/02/2022 | 1 | 0 | 100 | 0 |
1 | LAC | LAL | 1 | 16/02/2022 | 2 | 0 | 100 | 0 |
2 | LAL | LAC | 1 | 17/02/2022 | 2 | 1 | 66.6667 | 33.3333 |
3 | LAL | LAC | 1 | 18/02/2022 | 2 | 2 | 50 | 50 |
4 | LAL | LAC | 1 | 19/02/2022 | 2 | 3 | 40 | 60 |
5 | LAC | LAL | 1 | 20/02/2022 | 3 | 3 | 50 | 50 |
6 | LAL | LAC | 1 | 21/02/2022 | 3 | 4 | 42.8571 | 57.1429 |
7 | LAC | LAL | 1 | 22/02/2022 | 4 | 4 | 50 | 50 |
[编辑]
回答 OP 的评论。
代码:
import pandas as pd
# Create a sample dataframe with more data points
df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'], ['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'], ['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022'], ['ABC','LAL', 1, '15/02/2022'], ['ABC','LAL', 1, '16/02/2022'], ['LAL','ABC', 1, '17/02/2022'], ['LAL','ABC', 1, '18/02/2022'], ['LAL','ABC', 1, '19/02/2022'], ['ABC','LAL', 1, '20/02/2022'], ['LAL','ABC', 1, '21/02/2022'], ['ABC','LAL', 1, '22/02/2022'], ['ABC','XYZ', 1, '15/02/2022'], ['ABC','XYZ', 1, '16/02/2022'], ['XYZ','ABC', 1, '17/02/2022'], ['XYZ','ABC', 1, '18/02/2022'], ['XYZ','ABC', 1, '19/02/2022'], ['ABC','XYZ', 1, '20/02/2022'], ['XYZ','ABC', 1, '21/02/2022'], ['ABC','XYZ', 1, '22/02/2022'], ['LAC','XYZ', 1, '15/02/2022'], ['LAC','XYZ', 1, '16/02/2022'], ['XYZ','LAC', 1, '17/02/2022'], ['XYZ','LAC', 1, '18/02/2022'], ['XYZ','LAC', 1, '19/02/2022'], ['LAC','XYZ', 1, '20/02/2022'], ['XYZ','LAC', 1, '21/02/2022'], ['LAC','XYZ', 1, '22/02/2022']], columns = ['winner', 'loser', 'won', 'date'])
# In order to group by games, make sorted game titles like "LAC-LAL"
df['game'] = df.apply(lambda r: '-'.join(sorted([r.winner, r.loser])), axis=1)
# Ensure that df is sorted game and date (date must align in the ascending order)
df = df.sort_values(['game', 'date'], ignore_index=True)
# Assign 1 if the left team in the game title, otherwise 0. For example, "LAC" is the left team in the game title "LAC-LAL"
df['left_win'] = df.apply(lambda r: f'{r.winner}-{r.loser}'==r.game, axis=1)
# Do the same thing on the right team.
df['right_win'] = ~df.left_win
# Calculate the cumulative sumation.
df[['left_win_cumsum', 'right_win_cumsum']] = df.groupby('game')[['left_win', 'right_win']].cumsum()
# Shift and fill the first games as 0
df[['h2h_winner', 'h2h_loser']] = df.groupby('game')[['left_win_cumsum', 'right_win_cumsum']].shift().fillna(0).astype(int)
# Check the order in a pair of winner and loser columns. If the order is different from the game title, reverse the cumsum values
f = lambda r: [r.h2h_winner, r.h2h_loser] if f'{r.winner}-{r.loser}'==r.game else [r.h2h_loser, r.h2h_winner]
df[['h2h_winner', 'h2h_loser']] = df.apply(f, axis=1).apply(pd.Series)
# Drop all the temporary columns
df = df.drop(['game', 'left_win', 'right_win', 'left_win_cumsum', 'right_win_cumsum'], axis=1)
print(df.to_markdown(stralign='center', numalign='center'))
输出(仅提取 LAC - LAL 游戏):
winner | loser | won | date | h2h_winner | h2h_loser | |
---|---|---|---|---|---|---|
16 | LAC | LAL | 1 | 15/02/2022 | 0 | 0 |
17 | LAC | LAL | 1 | 16/02/2022 | 1 | 0 |
18 | LAL | LAC | 1 | 17/02/2022 | 0 | 2 |
19 | LAL | LAC | 1 | 18/02/2022 | 1 | 2 |
20 | LAL | LAC | 1 | 19/02/2022 | 2 | 2 |
21 | LAC | LAL | 1 | 20/02/2022 | 2 | 3 |
22 | LAL | LAC | 1 | 21/02/2022 | 3 | 3 |
23 | LAC | LAL | 1 | 22/02/2022 | 3 | 4 |
尝试:
tmp = pd.crosstab(df.index, df["winner"]).shift(fill_value=0).cumsum()
# prevent error if there's a team that only wins:
tmp = tmp.merge(
pd.DataFrame(columns=np.unique(df[["winner", "loser"]])), how="outer"
).fillna(0)
df[["winner_cnt", "loser_cnt"]] = df.apply(
lambda x: tmp.loc[x.name, x[["winner", "loser"]].values].values, axis=1
).apply(pd.Series)
print(df)
打印:
winner loser won date winner_cnt loser_cnt
0 LAC LAL 1 15/02/2022 0 0
1 LAC LAL 1 16/02/2022 1 0
2 LAL LAC 1 17/02/2022 0 2
3 LAL LAC 1 18/02/2022 1 2
4 LAL LAC 1 19/02/2022 2 2
5 LAC LAL 1 20/02/2022 2 3
6 LAL LAC 1 21/02/2022 3 3
7 LAC LAL 1 22/02/2022 3 4