获取 DataFrame 行的百分位数的最佳方式
Optimal way to acquire percentiles of DataFrame rows
问题
我有一个 pandas
DataFrame df
:
year val0 val1 val2 ... val98 val99
1983 -42.187 15.213 -32.185 12.887 -33.821
1984 39.213 -142.344 23.221 0.230 1.000
1985 -31.204 0.539 2.000 -1.000 3.442
...
2007 4.239 5.648 -15.483 3.794 -25.459
2008 6.431 0.831 -34.210 0.000 24.527
2009 -0.160 2.639 -2.196 52.628 71.291
我想要的输出,即 new_df
,包含 9 个不同的百分位数,包括中位数,并且应该具有以下格式:
year percentile_10 percentile_20 percentile_30 percentile_40 median percentile_60 percentile_70 percentile_80 percentile_90
1983 -40.382 -33.182 -25.483 -21.582 -14.424 -9.852 -3.852 6.247 10.528
...
2009 -3.248 0.412 6.672 10.536 12.428 20.582 46.248 52.837 78.991
尝试
以下是我的初步尝试:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()
但是,它没有返回所有列的百分位数,而是为每个 val
列计算了这些百分位数,因此返回了 1000 列。当它计算每个 val
的百分位数时,所有百分位数都返回相同的值。
通过尝试以下操作,我仍然设法 运行 完成了所需的任务:
list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []
for i in range(len(df)):
list_1.append(np.percentile(df.iloc[i,1:],10))
list_2.append(np.percentile(df.iloc[i,1:],20))
list_3.append(np.percentile(df.iloc[i,1:],30))
list_4.append(np.percentile(df.iloc[i,1:],40))
mlist.append(np.median(df.iloc[i,1:]))
list_6.append(np.percentile(df.iloc[i,1:],60))
list_7.append(np.percentile(df.iloc[i,1:],70))
list_8.append(np.percentile(df.iloc[i,1:],80))
list_9.append(np.percentile(df.iloc[i,1:],90))
df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9
new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]
但这显然是完成任务的一种费力、手动和一维的方式。查找多列每行百分位数的最佳方法是什么?
您可以像这样使用 .describe()
函数:
# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)
输出:
count mean std min 25% 50% 75% max
0 3.0 0.422915 1.440097 -0.940519 -0.330152 0.280215 1.104632 1.929049
1 3.0 1.615037 0.766079 0.799817 1.262538 1.725259 2.022647 2.320036
2 3.0 0.221560 0.700770 -0.585020 -0.008149 0.568721 0.624849 0.680978
3 3.0 -0.119638 0.182402 -0.274168 -0.220240 -0.166312 -0.042373 0.081565
4 3.0 -0.569942 0.807865 -1.085838 -1.035455 -0.985072 -0.311994 0.361084
如果您想要默认值 0.25, .05, .075
以外的其他百分位数,您可以创建自己的函数,在其中更改 .describe(percentiles = [0.1, 0.2...., 0.9])
的值
使用 DataFrame.quantile
和 convert year
来索引和最后转置,通过自定义 lambda 函数重命名列:
a = np.arange(1, 10) / 10
f = lambda x: f'percentile_{int(x * 100)}' if x != 0.5 else 'median'
new_df = df.set_index('year').quantile(a, axis=1).T.rename(columns=f)
print (new_df)
percentile_10 percentile_20 percentile_30 percentile_40 median \
year
1983 -38.8406 -35.4942 -33.4938 -32.8394 -32.185
1984 -85.3144 -28.2848 0.3840 0.6920 1.000
1985 -19.1224 -7.0408 -0.6922 -0.0766 0.539
2007 -21.4686 -17.4782 -11.6276 -3.9168 3.794
2008 -20.5260 -6.8420 0.1662 0.4986 0.831
2009 -1.3816 -0.5672 0.3998 1.5194 2.639
percentile_60 percentile_70 percentile_80 percentile_90
year
1983 -14.1562 3.8726 13.3522 14.2826
1984 9.8884 18.7768 26.4194 32.8162
1985 1.1234 1.7078 2.2884 2.8652
2007 3.9720 4.1500 4.5208 5.0844
2008 3.0710 5.3110 10.0502 17.2886
2009 22.6346 42.6302 56.3606 63.8258
问题
我有一个 pandas
DataFrame df
:
year val0 val1 val2 ... val98 val99
1983 -42.187 15.213 -32.185 12.887 -33.821
1984 39.213 -142.344 23.221 0.230 1.000
1985 -31.204 0.539 2.000 -1.000 3.442
...
2007 4.239 5.648 -15.483 3.794 -25.459
2008 6.431 0.831 -34.210 0.000 24.527
2009 -0.160 2.639 -2.196 52.628 71.291
我想要的输出,即 new_df
,包含 9 个不同的百分位数,包括中位数,并且应该具有以下格式:
year percentile_10 percentile_20 percentile_30 percentile_40 median percentile_60 percentile_70 percentile_80 percentile_90
1983 -40.382 -33.182 -25.483 -21.582 -14.424 -9.852 -3.852 6.247 10.528
...
2009 -3.248 0.412 6.672 10.536 12.428 20.582 46.248 52.837 78.991
尝试
以下是我的初步尝试:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()
但是,它没有返回所有列的百分位数,而是为每个 val
列计算了这些百分位数,因此返回了 1000 列。当它计算每个 val
的百分位数时,所有百分位数都返回相同的值。
通过尝试以下操作,我仍然设法 运行 完成了所需的任务:
list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []
for i in range(len(df)):
list_1.append(np.percentile(df.iloc[i,1:],10))
list_2.append(np.percentile(df.iloc[i,1:],20))
list_3.append(np.percentile(df.iloc[i,1:],30))
list_4.append(np.percentile(df.iloc[i,1:],40))
mlist.append(np.median(df.iloc[i,1:]))
list_6.append(np.percentile(df.iloc[i,1:],60))
list_7.append(np.percentile(df.iloc[i,1:],70))
list_8.append(np.percentile(df.iloc[i,1:],80))
list_9.append(np.percentile(df.iloc[i,1:],90))
df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9
new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]
但这显然是完成任务的一种费力、手动和一维的方式。查找多列每行百分位数的最佳方法是什么?
您可以像这样使用 .describe()
函数:
# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)
输出:
count mean std min 25% 50% 75% max
0 3.0 0.422915 1.440097 -0.940519 -0.330152 0.280215 1.104632 1.929049
1 3.0 1.615037 0.766079 0.799817 1.262538 1.725259 2.022647 2.320036
2 3.0 0.221560 0.700770 -0.585020 -0.008149 0.568721 0.624849 0.680978
3 3.0 -0.119638 0.182402 -0.274168 -0.220240 -0.166312 -0.042373 0.081565
4 3.0 -0.569942 0.807865 -1.085838 -1.035455 -0.985072 -0.311994 0.361084
如果您想要默认值 0.25, .05, .075
以外的其他百分位数,您可以创建自己的函数,在其中更改 .describe(percentiles = [0.1, 0.2...., 0.9])
使用 DataFrame.quantile
和 convert year
来索引和最后转置,通过自定义 lambda 函数重命名列:
a = np.arange(1, 10) / 10
f = lambda x: f'percentile_{int(x * 100)}' if x != 0.5 else 'median'
new_df = df.set_index('year').quantile(a, axis=1).T.rename(columns=f)
print (new_df)
percentile_10 percentile_20 percentile_30 percentile_40 median \
year
1983 -38.8406 -35.4942 -33.4938 -32.8394 -32.185
1984 -85.3144 -28.2848 0.3840 0.6920 1.000
1985 -19.1224 -7.0408 -0.6922 -0.0766 0.539
2007 -21.4686 -17.4782 -11.6276 -3.9168 3.794
2008 -20.5260 -6.8420 0.1662 0.4986 0.831
2009 -1.3816 -0.5672 0.3998 1.5194 2.639
percentile_60 percentile_70 percentile_80 percentile_90
year
1983 -14.1562 3.8726 13.3522 14.2826
1984 9.8884 18.7768 26.4194 32.8162
1985 1.1234 1.7078 2.2884 2.8652
2007 3.9720 4.1500 4.5208 5.0844
2008 3.0710 5.3110 10.0502 17.2886
2009 22.6346 42.6302 56.3606 63.8258