Python - 创建 2 个具有多个行值的第 25 个和第 75 个百分位数的新列

Python - To create 2 new column with 25th and 75th percentile of several row values

这是我的 df 的样子(有更多的行和更多的列):

Index WTG1 WTG2 WTG3
1.5 61.25 -7.57 7.18
2 19.69 25.95 28.67
2.5 59.51 81.22 78.22
3 131.81 154.07 142.92

我的objective是得到:

Index WTG1 WTG2 WTG3 25th Percentile 75th Percentile Mean
1.5 61.25 -7.57 7.18 (25th Percentile of 61.2, -7.57, 7.18) (75th Percentile of 61.2, -7.57, 7.18) (Avg. of 61.2, -7.57, 7.18)
2 19.6 25.95 28.67 (25th Percentile of 19.69, 25.95, 28.67) (75th Percentile of 19.69, 25.95, 28.67) (AVG. of 19.69, 25.95, 28.67)
2.5 59.51 81.22 78.22 (25th Percentile of 59.51, 81.22, 78.22) (75th Percentile of 59.51, 81.22, 78.22) (AVG. of 59.51, 81.22, 78.22)
3 131.81 154.07 142.92 (25th Percentile of 131.81, 154.07, 142.92) (75th Percentile of 131.81, 154.07, 142.92) (AVG. of 131.81, 154.07, 142.92)

我已经找了很长时间了,尽我所能:

df['mean'] = df[['WTG1','WTG2','WTG3'].mean(axis=1)
df['25th Percentile'] = np.nanpercentile(df[['WTG1','WTG2','WTG3']],25)
df['75th Percentile'] = np.nanpercentile(df[['WTG1','WTG2','WTG3']],75)

平均值似乎有效,但尚未检查值。

但百分位数才是真正的问题...似乎 nanpercentile 函数仅适用于列。 returns 两个百分位数列的每一行都有相同的值(我猜这是各自的第 25 和第 75 个百分位值,但在整个 df 中),这不是我要做的。

我找到了一些替代品,但无法根据我的需要进行调整,例如:

perc75 = np.vectorize(lambda x: np.percentile(x, 75))
df['75th_percentile'] = perc75(df['WTG01'].values)

哪个有效,但仅适用于一列。

df['25th_percentile'] = df['WTG1','WTG2','WTG3'].apply(lambda x: np.percentile(x, 25))

这不起作用...

我想你可以转置 DataFrame 并应用 df.describe()

import pandas as pd
df = pd.DataFrame({'WTG1': [61.25, 19.69, 59.51, 131.81],
                   'WTG2': [-7.57, 25.95, 81.22, 154.07],
                   'WTG3': [7.18, 28.67, 78.22, 142.92]
                   })
print(df)
print(df.T)

输出

     WTG1    WTG2    WTG3
0   61.25   -7.57    7.18
1   19.69   25.95   28.67
2   59.51   81.22   78.22
3  131.81  154.07  142.92

          0      1      2       3
WTG1  61.25  19.69  59.51  131.81
WTG2  -7.57  25.95  81.22  154.07
WTG3   7.18  28.67  78.22  142.92

在问题中,您试图获取每一行的统计信息。转置 DataFrame 后,您可以改为查看列,这样您就可以方便地获得每一列的汇总统计信息

print(df.T.describe())

输出

               0          1          2           3
count   3.000000   3.000000   3.000000    3.000000
mean   20.286667  24.770000  72.983333  142.933333
std    36.233778   4.604824  11.764269   11.130006
min    -7.570000  19.690000  59.510000  131.810000
25%    -0.195000  22.820000  68.865000  137.365000
50%     7.180000  25.950000  78.220000  142.920000
75%    34.215000  27.310000  79.720000  148.495000
max    61.250000  28.670000  81.220000  154.070000