按唯一 ID 分组,应用函数,并为下一组更新特定列
Grouping by unique IDs, applying a function, and updating a certain column for next groups
我有一个如下所示的数据框:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
...: 'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', '
...: joe', 'tom' ],
...: 'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2],
...: 'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
In [3]: df
Out[3]:
match_id name rank rating
0 m1 peter 2 100
1 m1 mike 3 100
2 m1 jeff 1 100
3 m1 john 4 100
4 m2 alex 3 100
5 m2 joe 1 100
6 m2 jeff 2 100
7 m2 peter 4 100
8 m3 alex 4 100
9 m3 peter 3 100
10 m3 joe 1 100
11 m3 tom 2 100
大约是三场具有唯一 "match_id" 的比赛,参赛者姓名,他们在比赛结束时的排名,以及为整个数据框手动设置为 100 的默认评分。
我想根据 "match_id"s 和 运行 分别为每个匹配的函数对数据进行分组,但该函数的输出应该用于更新下一个匹配的列。
我想使用一个函数来计算球员在每场比赛后更新的评分并将其放入名为 "updated_rating" 的新列中。我厌倦的功能在第一场比赛中看起来像这样:
df = df.loc[df['match_id'] == 'm1']
N = len(df)
df['win_prob'] = 0.0
for i in range(N):
for j in range(N):
if i != j:
df['S'] = (N - df['rank']) / ((N*(N-1))/2)
df['win_prob'][i] += (1 / (1 + (10 ** ((df['rating'][i] - df['rating'][j])/400))))
df['normalized_win_prob'] = df['win_prob']/(N*(N-1)/2)
df['updated_rating'] = round(df['rating'] + (20 * (df['S'] - df['normalized_win_prob'])), 1)
这将在第一场比赛中发挥作用,并根据每个玩家的原始评分计算更新后的评分以及获胜的概率。但是,我无法将其扩展到考虑以下匹配项。
由于一些球员在接下来的比赛中再次出现,我想更新他们的评分(基于前一阶段计算的 "updated_rating" 列)并让该函数完成第二场比赛和第三场比赛的工作之后匹配。
因此,例如,第一个匹配 计算后的输出 将如下所示:
match_id name rank rating win_prob S normalized_win_prob updated_rating
0 m1 peter 2 100 1.5 0.333333 0.25 101.7
1 m1 mike 3 100 1.5 0.166667 0.25 98.3
2 m1 jeff 1 100 1.5 0.500000 0.25 105.0
3 m1 john 4 100 1.5 0.000000 0.25 95.0
关于如何有效地执行此操作的任何想法?
我的原始数据框比这个示例数据框大得多,所以我的解决方案需要高效。
谢谢
这是我的解决方案。由于您的算法必须逐一循环遍历 match_ids,因此我们首先需要对分组数据进行 for-loop
。然后要计算 win_prob
,您必须遍历每一行并计算其在同一场比赛中战胜其他行的相关概率。这不是很漂亮。虽然想不出更好的方法:(
df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3', 'm4', 'm4', 'm4', 'm4'],
'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2, 1, 3, 4, 2],
'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
# Pre-compute variables that don't depend on ratings
df['N'] = df.groupby('match_id')['name'].transform('count')
df['total_comb'] = ((df['N']*(df['N']-1))/2)
df['S'] = (df['N'] - df['rank']) / df['total_comb']
# Initialize win_prob and updated_rating
df['win_prob'] = np.zeros(len(df))
df['updated_rating'] = df['rating']
df['prev_rating'] = df['rating']
grouped = df.groupby('match_id', sort=True)
dfa = pd.DataFrame() #Final results will be stored here
last_names = []
#Loop through the match_ids from m1 to m2, m3. Note you can sort them when use 'groupby'
for name, dfg in grouped:
dfm = dfg.copy()
# Update the 'updated_rating' coming from last match_id
if len(last_names) > 0:
dfm.drop(columns=['updated_rating'], inplace=True)
df_last = dfa.loc[dfa['match_id'].isin(last_names),['name', 'updated_rating']]
df_last.drop_duplicates(subset=['name'], keep='last', inplace=True)
dfm = dfm.merge(df_last, left_on='name', right_on='name', how='left')
dfm['prev_rating'] = np.where(np.isnan(dfm['updated_rating']), dfm['rating'], dfm['updated_rating'])
# Compute current 'updated_rating'
win_prob = []
for index, row in dfm.iterrows():
prob = np.sum(1.0/(1+10**((row['prev_rating'] - dfm['prev_rating'])/400)))-0.5 #subtract 0.5 to account for self
win_prob.append(prob)
dfm['win_prob'] = win_prob
dfm['normalized_win_prob'] = dfm['win_prob']/dfm['total_comb']
dfm['updated_rating'] = round(dfm['prev_rating'] + (20 * (dfm['S'] - dfm['normalized_win_prob'])), 1)
last_names.append(name)
dfa = pd.concat([dfa, dfm], sort=True)
dfa
输出:
N S match_id name normalized_win_prob prev_rating rank rating total_comb updated_rating win_prob
4 0.333333333 m1 peter 0.25 100 2 100 6 101.7 1.5
4 0.166666667 m1 mike 0.25 100 3 100 6 98.3 1.5
4 0.5 m1 jeff 0.25 100 1 100 6 105 1.5
4 0 m1 john 0.25 100 4 100 6 95 1.5
4 0.166666667 m2 alex 0.251606926 100 3 100 6 98.3 1.509641559
4 0.5 m2 joe 0.251606926 100 1 100 6 105 1.509641559
4 0.333333333 m2 jeff 0.24681015 105 2 100 6 106.7 1.480860898
4 0 m2 peter 0.249975997 101.7 4 100 6 96.7 1.499855985
4 0 m3 alex 0.251630798 98.3 4 100 6 93.3 1.509784788
4 0.166666667 m3 peter 0.253165649 96.7 3 100 6 95 1.518993896
4 0.5 m3 joe 0.245203608 105 1 100 6 110.1 1.47122165
4 0.333333333 m3 tom 0.249999944 100 2 100 6 101.7 1.499999666
4 0.5 m4 mike 0.249232493 98.3 1 100 6 103.3 1.495394959
4 0.166666667 m4 john 0.252398303 95 3 100 6 93.3 1.514389819
4 0 m4 tom 0.2459709 101.7 4 100 6 96.8 1.475825403
4 0.333333333 m4 peter 0.252398303 95 2 100 6 96.6 1.514389819
我有一个如下所示的数据框:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
...: 'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', '
...: joe', 'tom' ],
...: 'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2],
...: 'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
In [3]: df
Out[3]:
match_id name rank rating
0 m1 peter 2 100
1 m1 mike 3 100
2 m1 jeff 1 100
3 m1 john 4 100
4 m2 alex 3 100
5 m2 joe 1 100
6 m2 jeff 2 100
7 m2 peter 4 100
8 m3 alex 4 100
9 m3 peter 3 100
10 m3 joe 1 100
11 m3 tom 2 100
大约是三场具有唯一 "match_id" 的比赛,参赛者姓名,他们在比赛结束时的排名,以及为整个数据框手动设置为 100 的默认评分。
我想根据 "match_id"s 和 运行 分别为每个匹配的函数对数据进行分组,但该函数的输出应该用于更新下一个匹配的列。
我想使用一个函数来计算球员在每场比赛后更新的评分并将其放入名为 "updated_rating" 的新列中。我厌倦的功能在第一场比赛中看起来像这样:
df = df.loc[df['match_id'] == 'm1']
N = len(df)
df['win_prob'] = 0.0
for i in range(N):
for j in range(N):
if i != j:
df['S'] = (N - df['rank']) / ((N*(N-1))/2)
df['win_prob'][i] += (1 / (1 + (10 ** ((df['rating'][i] - df['rating'][j])/400))))
df['normalized_win_prob'] = df['win_prob']/(N*(N-1)/2)
df['updated_rating'] = round(df['rating'] + (20 * (df['S'] - df['normalized_win_prob'])), 1)
这将在第一场比赛中发挥作用,并根据每个玩家的原始评分计算更新后的评分以及获胜的概率。但是,我无法将其扩展到考虑以下匹配项。
由于一些球员在接下来的比赛中再次出现,我想更新他们的评分(基于前一阶段计算的 "updated_rating" 列)并让该函数完成第二场比赛和第三场比赛的工作之后匹配。
因此,例如,第一个匹配 计算后的输出 将如下所示:
match_id name rank rating win_prob S normalized_win_prob updated_rating
0 m1 peter 2 100 1.5 0.333333 0.25 101.7
1 m1 mike 3 100 1.5 0.166667 0.25 98.3
2 m1 jeff 1 100 1.5 0.500000 0.25 105.0
3 m1 john 4 100 1.5 0.000000 0.25 95.0
关于如何有效地执行此操作的任何想法? 我的原始数据框比这个示例数据框大得多,所以我的解决方案需要高效。
谢谢
这是我的解决方案。由于您的算法必须逐一循环遍历 match_ids,因此我们首先需要对分组数据进行 for-loop
。然后要计算 win_prob
,您必须遍历每一行并计算其在同一场比赛中战胜其他行的相关概率。这不是很漂亮。虽然想不出更好的方法:(
df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3', 'm4', 'm4', 'm4', 'm4'],
'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2, 1, 3, 4, 2],
'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
# Pre-compute variables that don't depend on ratings
df['N'] = df.groupby('match_id')['name'].transform('count')
df['total_comb'] = ((df['N']*(df['N']-1))/2)
df['S'] = (df['N'] - df['rank']) / df['total_comb']
# Initialize win_prob and updated_rating
df['win_prob'] = np.zeros(len(df))
df['updated_rating'] = df['rating']
df['prev_rating'] = df['rating']
grouped = df.groupby('match_id', sort=True)
dfa = pd.DataFrame() #Final results will be stored here
last_names = []
#Loop through the match_ids from m1 to m2, m3. Note you can sort them when use 'groupby'
for name, dfg in grouped:
dfm = dfg.copy()
# Update the 'updated_rating' coming from last match_id
if len(last_names) > 0:
dfm.drop(columns=['updated_rating'], inplace=True)
df_last = dfa.loc[dfa['match_id'].isin(last_names),['name', 'updated_rating']]
df_last.drop_duplicates(subset=['name'], keep='last', inplace=True)
dfm = dfm.merge(df_last, left_on='name', right_on='name', how='left')
dfm['prev_rating'] = np.where(np.isnan(dfm['updated_rating']), dfm['rating'], dfm['updated_rating'])
# Compute current 'updated_rating'
win_prob = []
for index, row in dfm.iterrows():
prob = np.sum(1.0/(1+10**((row['prev_rating'] - dfm['prev_rating'])/400)))-0.5 #subtract 0.5 to account for self
win_prob.append(prob)
dfm['win_prob'] = win_prob
dfm['normalized_win_prob'] = dfm['win_prob']/dfm['total_comb']
dfm['updated_rating'] = round(dfm['prev_rating'] + (20 * (dfm['S'] - dfm['normalized_win_prob'])), 1)
last_names.append(name)
dfa = pd.concat([dfa, dfm], sort=True)
dfa
输出:
N S match_id name normalized_win_prob prev_rating rank rating total_comb updated_rating win_prob
4 0.333333333 m1 peter 0.25 100 2 100 6 101.7 1.5
4 0.166666667 m1 mike 0.25 100 3 100 6 98.3 1.5
4 0.5 m1 jeff 0.25 100 1 100 6 105 1.5
4 0 m1 john 0.25 100 4 100 6 95 1.5
4 0.166666667 m2 alex 0.251606926 100 3 100 6 98.3 1.509641559
4 0.5 m2 joe 0.251606926 100 1 100 6 105 1.509641559
4 0.333333333 m2 jeff 0.24681015 105 2 100 6 106.7 1.480860898
4 0 m2 peter 0.249975997 101.7 4 100 6 96.7 1.499855985
4 0 m3 alex 0.251630798 98.3 4 100 6 93.3 1.509784788
4 0.166666667 m3 peter 0.253165649 96.7 3 100 6 95 1.518993896
4 0.5 m3 joe 0.245203608 105 1 100 6 110.1 1.47122165
4 0.333333333 m3 tom 0.249999944 100 2 100 6 101.7 1.499999666
4 0.5 m4 mike 0.249232493 98.3 1 100 6 103.3 1.495394959
4 0.166666667 m4 john 0.252398303 95 3 100 6 93.3 1.514389819
4 0 m4 tom 0.2459709 101.7 4 100 6 96.8 1.475825403
4 0.333333333 m4 peter 0.252398303 95 2 100 6 96.6 1.514389819