获取 DataFrame 行的百分位数的最佳方式

Optimal way to acquire percentiles of DataFrame rows

问题

我有一个 pandas DataFrame df:

year        val0        val1        val2         ...          val98         val99
1983        -42.187     15.213      -32.185                   12.887        -33.821
1984        39.213      -142.344    23.221                    0.230         1.000
1985        -31.204     0.539       2.000                     -1.000        3.442
...
2007        4.239       5.648       -15.483                   3.794         -25.459
2008        6.431       0.831       -34.210                   0.000         24.527
2009        -0.160      2.639       -2.196                    52.628        71.291

我想要的输出,即 new_df,包含 9 个不同的百分位数,包括中位数,并且应该具有以下格式:

year    percentile_10    percentile_20    percentile_30    percentile_40    median    percentile_60    percentile_70    percentile_80    percentile_90
1983    -40.382          -33.182          -25.483          -21.582          -14.424   -9.852           -3.852           6.247            10.528
...
2009    -3.248           0.412            6.672            10.536           12.428    20.582           46.248           52.837           78.991

尝试

以下是我的初步尝试:

def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()

但是,它没有返回所有列的百分位数,而是为每个 val 列计算了这些百分位数,因此返回了 1000 列。当它计算每个 val 的百分位数时,所有百分位数都返回相同的值。

通过尝试以下操作,我仍然设法 运行 完成了所需的任务:

list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []

for i in range(len(df)):
  list_1.append(np.percentile(df.iloc[i,1:],10))
  list_2.append(np.percentile(df.iloc[i,1:],20))
  list_3.append(np.percentile(df.iloc[i,1:],30))
  list_4.append(np.percentile(df.iloc[i,1:],40))
  mlist.append(np.median(df.iloc[i,1:]))
  list_6.append(np.percentile(df.iloc[i,1:],60))
  list_7.append(np.percentile(df.iloc[i,1:],70))
  list_8.append(np.percentile(df.iloc[i,1:],80))
  list_9.append(np.percentile(df.iloc[i,1:],90))

df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9

new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]

但这显然是完成任务的一种费力、手动和一维的方式。查找多列每行百分位数的最佳方法是什么?

您可以像这样使用 .describe() 函数:

# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)

输出:

   count      mean       std       min       25%       50%       75%       max
0    3.0  0.422915  1.440097 -0.940519 -0.330152  0.280215  1.104632  1.929049
1    3.0  1.615037  0.766079  0.799817  1.262538  1.725259  2.022647  2.320036
2    3.0  0.221560  0.700770 -0.585020 -0.008149  0.568721  0.624849  0.680978
3    3.0 -0.119638  0.182402 -0.274168 -0.220240 -0.166312 -0.042373  0.081565
4    3.0 -0.569942  0.807865 -1.085838 -1.035455 -0.985072 -0.311994  0.361084

如果您想要默认值 0.25, .05, .075 以外的其他百分位数,您可以创建自己的函数,在其中更改 .describe(percentiles = [0.1, 0.2...., 0.9])

的值

使用 DataFrame.quantile 和 convert year 来索引和最后转置,通过自定义 lambda 函数重命名列:

a = np.arange(1, 10) / 10
f = lambda x: f'percentile_{int(x * 100)}' if x != 0.5 else 'median'
new_df = df.set_index('year').quantile(a, axis=1).T.rename(columns=f)
print (new_df)
      percentile_10  percentile_20  percentile_30  percentile_40  median  \
year                                                                       
1983       -38.8406       -35.4942       -33.4938       -32.8394 -32.185   
1984       -85.3144       -28.2848         0.3840         0.6920   1.000   
1985       -19.1224        -7.0408        -0.6922        -0.0766   0.539   
2007       -21.4686       -17.4782       -11.6276        -3.9168   3.794   
2008       -20.5260        -6.8420         0.1662         0.4986   0.831   
2009        -1.3816        -0.5672         0.3998         1.5194   2.639   

      percentile_60  percentile_70  percentile_80  percentile_90  
year                                                              
1983       -14.1562         3.8726        13.3522        14.2826  
1984         9.8884        18.7768        26.4194        32.8162  
1985         1.1234         1.7078         2.2884         2.8652  
2007         3.9720         4.1500         4.5208         5.0844  
2008         3.0710         5.3110        10.0502        17.2886  
2009        22.6346        42.6302        56.3606        63.8258