Pandas 基于每个索引中的组的 percentrank
Pandas percentrank based on groups within each index
我有一个数据框,它有一个包含日期的索引(有多个相同的日期)。每个日期都有价格、分数、类别等列......
我想在名为 pctrank 的数据框中添加 1 个新列。
在 pctrank 列中,我想根据 Score 值计算每个索引级别的每个类别中的百分位排名。例如,对于以下数据中的 2007 年 1 月 24 日,我会对超市的所有分数进行百分比排名,并分别对所有 Reteraunts 的所有分数进行百分比排名,然后转到下一个日期。
由于数据集很大,我希望它的效率合理。
** 以下示例数据 **
df 的子集:
Category SCORE
1/24/2017 SuperMarket 12
1/24/2017 Resteraunt 21
1/24/2017 SuperMarket 13
1/24/2017 SuperMarket 22
1/24/2017 Resteraunt 27
1/24/2017 SuperMarket 30
1/24/2017 Resteraunt 34
1/24/2017 Resteraunt 32
1/24/2017 Resteraunt 21
1/24/2017 Resteraunt 12
1/24/2017 Bar 10
1/24/2017 Bar 3
1/24/2017 Bar 24
1/25/2017 Resteraunt 32
1/25/2017 Resteraunt 63
1/25/2017 Resteraunt 32
1/25/2017 Bar 12
1/25/2017 Bar 32
1/25/2017 Hospital 22
1/25/2017 Hospital 12
1/25/2017 Bar 10
示例输出:
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0
1/24/2017 Resteraunt 21 0.2
1/24/2017 SuperMarket 13 0.333
1/24/2017 SuperMarket 22 0.666
1/24/2017 Resteraunt 27 0.6
1/24/2017 SuperMarket 30 1
1/24/2017 Resteraunt 34 1
1/24/2017 Resteraunt 32 0.8
1/24/2017 Resteraunt 21 0.2
1/24/2017 Resteraunt 12 0
1/24/2017 Bar 10 0.5
1/24/2017 Bar 3 0
1/24/2017 Bar 24 1
1/25/2017 Resteraunt 32 0
1/25/2017 Resteraunt 63 1
1/25/2017 Resteraunt 32 0
1/25/2017 Bar 12 0.5
1/25/2017 Bar 32 1
1/25/2017 Hospital 22 1
1/25/2017 Hospital 12 0
1/25/2017 Bar 10 0
真实数据集包含大量日期和相应的条目。
您可以将 groupby
与 rank
divide by nunique
一起使用 - 从 0
开始需要减去 1
:
df['Percnt rank'] = df.reset_index() \
.groupby(['index','Category'])['SCORE'] \
.apply(lambda x: (x.rank(method='dense') - 1) / (x.nunique() - 1) ) \
.values
print (df)
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0.000000
1/24/2017 Resteraunt 21 0.250000
1/24/2017 SuperMarket 13 0.333333
1/24/2017 SuperMarket 22 0.666667
1/24/2017 Resteraunt 27 0.500000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 1.000000
1/24/2017 Resteraunt 32 0.750000
1/24/2017 Resteraunt 21 0.250000
1/24/2017 Resteraunt 12 0.000000
1/24/2017 Bar 10 0.500000
1/24/2017 Bar 3 0.000000
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Resteraunt 63 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Bar 12 0.500000
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.000000
1/25/2017 Bar 10 0.000000
因为使用默认 rank
,输出不同:
df['Percnt rank'] = df.reset_index()\
.groupby(['index','Category'])['SCORE'].rank(method='dense', pct=True)\
.values
print (df)
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0.250000
1/24/2017 Resteraunt 21 0.333333
1/24/2017 SuperMarket 13 0.500000
1/24/2017 SuperMarket 22 0.750000
1/24/2017 Resteraunt 27 0.500000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 0.833333
1/24/2017 Resteraunt 32 0.666667
1/24/2017 Resteraunt 21 0.333333
1/24/2017 Resteraunt 12 0.166667
1/24/2017 Bar 10 0.666667
1/24/2017 Bar 3 0.333333
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.333333
1/25/2017 Resteraunt 63 0.666667
1/25/2017 Resteraunt 32 0.333333
1/25/2017 Bar 12 0.666667
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.500000
1/25/2017 Bar 10 0.333333
使用自定义函数计算 rank(method='dense', pct=True)
不包括最小值,然后用 0
填回
def prank(s):
mask = s.values != s.values.min()
r = pd.Series(index=s.index)
r.loc[mask] = s.loc[mask].rank(method='dense', pct=True)
return r.fillna(0)
df.assign(**{'Percent rank': df.reset_index().groupby(['index', 'Category']).SCORE.apply(prank).values})
Category SCORE Percent rank
1/24/2017 SuperMarket 12 0.000000
1/24/2017 Resteraunt 21 0.200000
1/24/2017 SuperMarket 13 0.333333
1/24/2017 SuperMarket 22 0.666667
1/24/2017 Resteraunt 27 0.400000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 0.800000
1/24/2017 Resteraunt 32 0.600000
1/24/2017 Resteraunt 21 0.200000
1/24/2017 Resteraunt 12 0.000000
1/24/2017 Bar 10 0.500000
1/24/2017 Bar 3 0.000000
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Resteraunt 63 1.000000
1/25/2017 Resteraunt 32 0.500000
1/25/2017 Bar 12 0.500000
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.000000
1/25/2017 Bar 10 0.000000
我有一个数据框,它有一个包含日期的索引(有多个相同的日期)。每个日期都有价格、分数、类别等列......
我想在名为 pctrank 的数据框中添加 1 个新列。
在 pctrank 列中,我想根据 Score 值计算每个索引级别的每个类别中的百分位排名。例如,对于以下数据中的 2007 年 1 月 24 日,我会对超市的所有分数进行百分比排名,并分别对所有 Reteraunts 的所有分数进行百分比排名,然后转到下一个日期。
由于数据集很大,我希望它的效率合理。
** 以下示例数据 **
df 的子集:
Category SCORE
1/24/2017 SuperMarket 12
1/24/2017 Resteraunt 21
1/24/2017 SuperMarket 13
1/24/2017 SuperMarket 22
1/24/2017 Resteraunt 27
1/24/2017 SuperMarket 30
1/24/2017 Resteraunt 34
1/24/2017 Resteraunt 32
1/24/2017 Resteraunt 21
1/24/2017 Resteraunt 12
1/24/2017 Bar 10
1/24/2017 Bar 3
1/24/2017 Bar 24
1/25/2017 Resteraunt 32
1/25/2017 Resteraunt 63
1/25/2017 Resteraunt 32
1/25/2017 Bar 12
1/25/2017 Bar 32
1/25/2017 Hospital 22
1/25/2017 Hospital 12
1/25/2017 Bar 10
示例输出:
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0
1/24/2017 Resteraunt 21 0.2
1/24/2017 SuperMarket 13 0.333
1/24/2017 SuperMarket 22 0.666
1/24/2017 Resteraunt 27 0.6
1/24/2017 SuperMarket 30 1
1/24/2017 Resteraunt 34 1
1/24/2017 Resteraunt 32 0.8
1/24/2017 Resteraunt 21 0.2
1/24/2017 Resteraunt 12 0
1/24/2017 Bar 10 0.5
1/24/2017 Bar 3 0
1/24/2017 Bar 24 1
1/25/2017 Resteraunt 32 0
1/25/2017 Resteraunt 63 1
1/25/2017 Resteraunt 32 0
1/25/2017 Bar 12 0.5
1/25/2017 Bar 32 1
1/25/2017 Hospital 22 1
1/25/2017 Hospital 12 0
1/25/2017 Bar 10 0
真实数据集包含大量日期和相应的条目。
您可以将 groupby
与 rank
divide by nunique
一起使用 - 从 0
开始需要减去 1
:
df['Percnt rank'] = df.reset_index() \
.groupby(['index','Category'])['SCORE'] \
.apply(lambda x: (x.rank(method='dense') - 1) / (x.nunique() - 1) ) \
.values
print (df)
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0.000000
1/24/2017 Resteraunt 21 0.250000
1/24/2017 SuperMarket 13 0.333333
1/24/2017 SuperMarket 22 0.666667
1/24/2017 Resteraunt 27 0.500000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 1.000000
1/24/2017 Resteraunt 32 0.750000
1/24/2017 Resteraunt 21 0.250000
1/24/2017 Resteraunt 12 0.000000
1/24/2017 Bar 10 0.500000
1/24/2017 Bar 3 0.000000
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Resteraunt 63 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Bar 12 0.500000
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.000000
1/25/2017 Bar 10 0.000000
因为使用默认 rank
,输出不同:
df['Percnt rank'] = df.reset_index()\
.groupby(['index','Category'])['SCORE'].rank(method='dense', pct=True)\
.values
print (df)
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0.250000
1/24/2017 Resteraunt 21 0.333333
1/24/2017 SuperMarket 13 0.500000
1/24/2017 SuperMarket 22 0.750000
1/24/2017 Resteraunt 27 0.500000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 0.833333
1/24/2017 Resteraunt 32 0.666667
1/24/2017 Resteraunt 21 0.333333
1/24/2017 Resteraunt 12 0.166667
1/24/2017 Bar 10 0.666667
1/24/2017 Bar 3 0.333333
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.333333
1/25/2017 Resteraunt 63 0.666667
1/25/2017 Resteraunt 32 0.333333
1/25/2017 Bar 12 0.666667
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.500000
1/25/2017 Bar 10 0.333333
使用自定义函数计算 rank(method='dense', pct=True)
不包括最小值,然后用 0
def prank(s):
mask = s.values != s.values.min()
r = pd.Series(index=s.index)
r.loc[mask] = s.loc[mask].rank(method='dense', pct=True)
return r.fillna(0)
df.assign(**{'Percent rank': df.reset_index().groupby(['index', 'Category']).SCORE.apply(prank).values})
Category SCORE Percent rank
1/24/2017 SuperMarket 12 0.000000
1/24/2017 Resteraunt 21 0.200000
1/24/2017 SuperMarket 13 0.333333
1/24/2017 SuperMarket 22 0.666667
1/24/2017 Resteraunt 27 0.400000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 0.800000
1/24/2017 Resteraunt 32 0.600000
1/24/2017 Resteraunt 21 0.200000
1/24/2017 Resteraunt 12 0.000000
1/24/2017 Bar 10 0.500000
1/24/2017 Bar 3 0.000000
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Resteraunt 63 1.000000
1/25/2017 Resteraunt 32 0.500000
1/25/2017 Bar 12 0.500000
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.000000
1/25/2017 Bar 10 0.000000