按唯一 ID 分组,应用函数,并为下一组更新特定列

Grouping by unique IDs, applying a function, and updating a certain column for next groups

我有一个如下所示的数据框:


In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
   ...:                     'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', '
   ...: joe', 'tom' ],
   ...:                     'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2],
   ...:                     'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})

In [3]: df
Out[3]:
    match_id    name    rank  rating
0          m1  peter     2     100
1          m1   mike     3     100
2          m1   jeff     1     100
3          m1   john     4     100
4          m2   alex     3     100
5          m2    joe     1     100
6          m2   jeff     2     100
7          m2  peter     4     100
8          m3   alex     4     100
9          m3  peter     3     100
10         m3    joe     1     100
11         m3    tom     2     100

大约是三场具有唯一 "match_id" 的比赛,参赛者姓名,他们在比赛结束时的排名,以及为整个数据框手动设置为 100 的默认评分。

我想根据 "match_id"s 和 运行 分别为每个匹配的函数对数据进行分组,但该函数的输出应该用于更新下一个匹配的列。

我想使用一个函数来计算球员在每场比赛后更新的评分并将其放入名为 "updated_rating" 的新列中。我厌倦的功能在第一场比赛中看起来像这样:

df = df.loc[df['match_id'] == 'm1']
N = len(df)
df['win_prob'] = 0.0
for i in range(N):
    for j in range(N):
        if i != j:
            df['S'] = (N - df['rank']) / ((N*(N-1))/2)
            df['win_prob'][i] += (1 / (1 + (10 ** ((df['rating'][i] - df['rating'][j])/400))))
            df['normalized_win_prob'] = df['win_prob']/(N*(N-1)/2)
            df['updated_rating'] = round(df['rating'] + (20 * (df['S'] - df['normalized_win_prob'])), 1)

这将在第一场比赛中发挥作用,并根据每个玩家的原始评分计算更新后的评分以及获胜的概率。但是,我无法将其扩展到考虑以下匹配项。

由于一些球员在接下来的比赛中再次出现,我想更新他们的评分(基于前一阶段计算的 "updated_rating" 列)并让该函数完成第二场比赛和第三场比赛的工作之后匹配。

因此,例如,第一个匹配 计算后的输出 将如下所示:


match_id name rank rating  win_prob    S    normalized_win_prob  updated_rating
0   m1  peter   2   100     1.5     0.333333          0.25            101.7
1   m1  mike    3   100     1.5     0.166667          0.25             98.3
2   m1  jeff    1   100     1.5     0.500000          0.25            105.0
3   m1  john    4   100     1.5     0.000000          0.25             95.0

关于如何有效地执行此操作的任何想法? 我的原始数据框比这个示例数据框大得多,所以我的解决方案需要高效。

谢谢

这是我的解决方案。由于您的算法必须逐一循环遍历 match_ids,因此我们首先需要对分组数据进行 for-loop。然后要计算 win_prob,您必须遍历每一行并计算其在同一场比赛中战胜其他行的相关概率。这不是很漂亮。虽然想不出更好的方法:(

df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3', 'm4', 'm4', 'm4', 'm4'],
                   'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
                   'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2, 1, 3, 4, 2],
                   'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})

# Pre-compute variables that don't depend on ratings
df['N'] = df.groupby('match_id')['name'].transform('count')
df['total_comb'] = ((df['N']*(df['N']-1))/2)
df['S'] = (df['N'] - df['rank']) / df['total_comb']

# Initialize win_prob and updated_rating
df['win_prob'] = np.zeros(len(df))
df['updated_rating'] = df['rating']
df['prev_rating'] = df['rating']

grouped = df.groupby('match_id', sort=True)

dfa = pd.DataFrame() #Final results will be stored here
last_names = []
#Loop through the match_ids from m1 to m2, m3. Note you can sort them when use 'groupby'
for name, dfg in grouped:
    dfm = dfg.copy()
    # Update the 'updated_rating' coming from last match_id
    if len(last_names) > 0:
        dfm.drop(columns=['updated_rating'], inplace=True)
        df_last = dfa.loc[dfa['match_id'].isin(last_names),['name', 'updated_rating']]
        df_last.drop_duplicates(subset=['name'], keep='last', inplace=True)
        dfm = dfm.merge(df_last, left_on='name', right_on='name', how='left')
        dfm['prev_rating'] = np.where(np.isnan(dfm['updated_rating']), dfm['rating'], dfm['updated_rating'])

    # Compute current 'updated_rating'
    win_prob = []
    for index, row in dfm.iterrows():
        prob = np.sum(1.0/(1+10**((row['prev_rating'] - dfm['prev_rating'])/400)))-0.5 #subtract 0.5 to account for self
        win_prob.append(prob)

    dfm['win_prob'] = win_prob
    dfm['normalized_win_prob'] = dfm['win_prob']/dfm['total_comb']
    dfm['updated_rating'] = round(dfm['prev_rating'] + (20 * (dfm['S'] - dfm['normalized_win_prob'])), 1)
    last_names.append(name)
    dfa = pd.concat([dfa, dfm], sort=True)

dfa   

输出:

N       S         match_id  name    normalized_win_prob prev_rating   rank  rating  total_comb  updated_rating  win_prob
4   0.333333333     m1      peter   0.25                    100         2   100     6               101.7       1.5
4   0.166666667     m1      mike    0.25                    100         3   100     6               98.3        1.5
4   0.5             m1      jeff    0.25                    100         1   100     6               105         1.5
4   0               m1      john    0.25                    100         4   100     6               95          1.5
4   0.166666667     m2      alex    0.251606926             100         3   100     6               98.3        1.509641559
4   0.5             m2      joe     0.251606926             100         1   100     6               105         1.509641559
4   0.333333333     m2      jeff    0.24681015              105         2   100     6               106.7       1.480860898
4   0               m2      peter   0.249975997             101.7       4   100     6               96.7        1.499855985
4   0               m3      alex    0.251630798             98.3        4   100     6               93.3        1.509784788
4   0.166666667     m3      peter   0.253165649             96.7        3   100     6               95          1.518993896
4   0.5             m3      joe     0.245203608             105         1   100     6               110.1       1.47122165
4   0.333333333     m3      tom     0.249999944             100         2   100     6               101.7       1.499999666
4   0.5             m4      mike    0.249232493             98.3        1   100     6               103.3       1.495394959
4   0.166666667     m4      john    0.252398303             95          3   100     6               93.3        1.514389819
4   0               m4      tom     0.2459709               101.7       4   100     6               96.8        1.475825403
4   0.333333333     m4      peter   0.252398303             95          2   100     6               96.6        1.514389819