Pandas 基于每个索引中的组的 percentrank

Pandas percentrank based on groups within each index

我有一个数据框,它有一个包含日期的索引(有多个相同的日期)。每个日期都有价格、分数、类别等列......

我想在名为 pctrank 的数据框中添加 1 个新列。

在 pctrank 列中,我想根据 Score 值计算每个索引级别的每个类别中的百分位排名。例如,对于以下数据中的 2007 年 1 月 24 日,我会对超市的所有分数进行百分比排名,并分别对所有 Reteraunts 的所有分数进行百分比排名,然后转到下一个日期。

由于数据集很大,我希望它的效率合理。

** 以下示例数据 **

df 的子集:

            Category    SCORE
1/24/2017   SuperMarket 12
1/24/2017   Resteraunt  21
1/24/2017   SuperMarket 13
1/24/2017   SuperMarket 22
1/24/2017   Resteraunt  27
1/24/2017   SuperMarket 30
1/24/2017   Resteraunt  34
1/24/2017   Resteraunt  32
1/24/2017   Resteraunt  21
1/24/2017   Resteraunt  12
1/24/2017   Bar         10
1/24/2017   Bar          3
1/24/2017   Bar         24
1/25/2017   Resteraunt  32
1/25/2017   Resteraunt  63
1/25/2017   Resteraunt  32
1/25/2017   Bar         12
1/25/2017   Bar         32
1/25/2017   Hospital    22
1/25/2017   Hospital    12
1/25/2017   Bar         10

示例输出:

            Category    SCORE   Percnt rank    
1/24/2017   SuperMarket 12         0    
1/24/2017   Resteraunt  21         0.2  
1/24/2017   SuperMarket 13        0.333 
1/24/2017   SuperMarket 22        0.666  
1/24/2017   Resteraunt  27       0.6   
1/24/2017   SuperMarket 30         1    
1/24/2017   Resteraunt  34         1    
1/24/2017   Resteraunt  32       0.8   
1/24/2017   Resteraunt  21       0.2    
1/24/2017   Resteraunt  12       0  
1/24/2017   Bar         10       0.5    
1/24/2017   Bar          3       0   
1/24/2017   Bar         24       1  
1/25/2017   Resteraunt  32       0  
1/25/2017   Resteraunt  63       1  
1/25/2017   Resteraunt  32       0  
1/25/2017   Bar         12      0.5 
1/25/2017   Bar         32       1  
1/25/2017   Hospital    22      1   
1/25/2017   Hospital    12      0   
1/25/2017   Bar         10     0    

真实数据集包含大量日期和相应的条目。

您可以将 groupbyrank divide by nunique 一起使用 - 从 0 开始需要减去 1:

df['Percnt rank'] = df.reset_index() \
                      .groupby(['index','Category'])['SCORE'] \
                      .apply(lambda x: (x.rank(method='dense') - 1) / (x.nunique() - 1) ) \
                      .values
print (df)

              Category  SCORE  Percnt rank
1/24/2017  SuperMarket     12     0.000000
1/24/2017   Resteraunt     21     0.250000
1/24/2017  SuperMarket     13     0.333333
1/24/2017  SuperMarket     22     0.666667
1/24/2017   Resteraunt     27     0.500000
1/24/2017  SuperMarket     30     1.000000
1/24/2017   Resteraunt     34     1.000000
1/24/2017   Resteraunt     32     0.750000
1/24/2017   Resteraunt     21     0.250000
1/24/2017   Resteraunt     12     0.000000
1/24/2017          Bar     10     0.500000
1/24/2017          Bar      3     0.000000
1/24/2017          Bar     24     1.000000
1/25/2017   Resteraunt     32     0.000000
1/25/2017   Resteraunt     63     1.000000
1/25/2017   Resteraunt     32     0.000000
1/25/2017          Bar     12     0.500000
1/25/2017          Bar     32     1.000000
1/25/2017     Hospital     22     1.000000
1/25/2017     Hospital     12     0.000000
1/25/2017          Bar     10     0.000000

因为使用默认 rank,输出不同:

df['Percnt rank'] = df.reset_index()\
                      .groupby(['index','Category'])['SCORE'].rank(method='dense', pct=True)\
                      .values
print (df)
              Category  SCORE  Percnt rank
1/24/2017  SuperMarket     12     0.250000
1/24/2017   Resteraunt     21     0.333333
1/24/2017  SuperMarket     13     0.500000
1/24/2017  SuperMarket     22     0.750000
1/24/2017   Resteraunt     27     0.500000
1/24/2017  SuperMarket     30     1.000000
1/24/2017   Resteraunt     34     0.833333
1/24/2017   Resteraunt     32     0.666667
1/24/2017   Resteraunt     21     0.333333
1/24/2017   Resteraunt     12     0.166667
1/24/2017          Bar     10     0.666667
1/24/2017          Bar      3     0.333333
1/24/2017          Bar     24     1.000000
1/25/2017   Resteraunt     32     0.333333
1/25/2017   Resteraunt     63     0.666667
1/25/2017   Resteraunt     32     0.333333
1/25/2017          Bar     12     0.666667
1/25/2017          Bar     32     1.000000
1/25/2017     Hospital     22     1.000000
1/25/2017     Hospital     12     0.500000
1/25/2017          Bar     10     0.333333

使用自定义函数计算 rank(method='dense', pct=True) 不包括最小值,然后用 0

填回
def prank(s):
    mask = s.values != s.values.min()
    r = pd.Series(index=s.index)
    r.loc[mask] = s.loc[mask].rank(method='dense', pct=True)
    return r.fillna(0)


df.assign(**{'Percent rank': df.reset_index().groupby(['index', 'Category']).SCORE.apply(prank).values})

              Category  SCORE  Percent rank
1/24/2017  SuperMarket     12      0.000000
1/24/2017   Resteraunt     21      0.200000
1/24/2017  SuperMarket     13      0.333333
1/24/2017  SuperMarket     22      0.666667
1/24/2017   Resteraunt     27      0.400000
1/24/2017  SuperMarket     30      1.000000
1/24/2017   Resteraunt     34      0.800000
1/24/2017   Resteraunt     32      0.600000
1/24/2017   Resteraunt     21      0.200000
1/24/2017   Resteraunt     12      0.000000
1/24/2017          Bar     10      0.500000
1/24/2017          Bar      3      0.000000
1/24/2017          Bar     24      1.000000
1/25/2017   Resteraunt     32      0.000000
1/25/2017   Resteraunt     63      1.000000
1/25/2017   Resteraunt     32      0.500000
1/25/2017          Bar     12      0.500000
1/25/2017          Bar     32      1.000000
1/25/2017     Hospital     22      1.000000
1/25/2017     Hospital     12      0.000000
1/25/2017          Bar     10      0.000000