计算由另一列值分组的每个列的百分位值 - Pandas 数据框
calculating percentile values for each columns group by another column values - Pandas dataframe
我有一个如下所示的数据框 -
Year Salary Amount
0 2019 1200 53
1 2020 3443 455
2 2021 6777 123
3 2019 5466 313
4 2020 4656 545
5 2021 4565 775
6 2019 4654 567
7 2020 7867 657
8 2021 6766 567
Python 获取下面数据框的脚本 -
import pandas as pd
import numpy as np
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
我想为按 'Year' 分组的所有列计算某些百分位数值。
所需的输出应类似于 -
我在 运行 下面 python 脚本执行计算以计算某些百分位值-
df_percentile = pd.DataFrame()
p_list = [0.05, 0.10, 0.25, 0.50, 0.75, 0.95, 0.99]
c_list = []
p_values = []
for cols in d.columns[1:]:
for p in p_list:
c_list.append(cols + '_' + str(p))
p_values.append(np.percentile(d[cols], p))
print(len(c_list), len(p_values))
df_percentile['Name'] = pd.Series(c_list)
df_percentile['Value'] = pd.Series(p_values)
print(df_percentile)
输出 -
Name Value
0 Salary_0.05 1208.9720
1 Salary_0.1 1217.9440
2 Salary_0.25 1244.8600
3 Salary_0.5 1289.7200
4 Salary_0.75 1334.5800
5 Salary_0.95 1370.4680
6 Salary_0.99 1377.6456
7 Amount_0.05 53.2800
8 Amount_0.1 53.5600
9 Amount_0.25 54.4000
10 Amount_0.5 55.8000
11 Amount_0.75 57.2000
12 Amount_0.95 58.3200
13 Amount_0.99 58.5440
如何才能获得所需格式的输出而无需执行额外数据 manipulation/formatting 或减少代码行数?
您可以尝试 pivot
然后 quantile
:
(df.pivot(columns='Year')
.quantile([0.01,0.05,0.75, 0.95, 0.99])
.stack('Year')
)
输出:
Salary Amount
Year
0.01 2019 1269.08 58.20
2020 3467.26 456.80
2021 4609.02 131.88
0.05 2019 1545.40 79.00
2020 3564.30 464.00
2021 4785.10 167.40
0.75 2019 5060.00 440.00
2020 6261.50 601.00
2021 6771.50 671.00
0.95 2019 5384.80 541.60
2020 7545.90 645.80
2021 6775.90 754.20
0.99 2019 5449.76 561.92
2020 7802.78 654.76
2021 6776.78 770.84
我有一个如下所示的数据框 -
Year Salary Amount
0 2019 1200 53
1 2020 3443 455
2 2021 6777 123
3 2019 5466 313
4 2020 4656 545
5 2021 4565 775
6 2019 4654 567
7 2020 7867 657
8 2021 6766 567
Python 获取下面数据框的脚本 -
import pandas as pd
import numpy as np
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
我想为按 'Year' 分组的所有列计算某些百分位数值。 所需的输出应类似于 -
我在 运行 下面 python 脚本执行计算以计算某些百分位值-
df_percentile = pd.DataFrame()
p_list = [0.05, 0.10, 0.25, 0.50, 0.75, 0.95, 0.99]
c_list = []
p_values = []
for cols in d.columns[1:]:
for p in p_list:
c_list.append(cols + '_' + str(p))
p_values.append(np.percentile(d[cols], p))
print(len(c_list), len(p_values))
df_percentile['Name'] = pd.Series(c_list)
df_percentile['Value'] = pd.Series(p_values)
print(df_percentile)
输出 -
Name Value
0 Salary_0.05 1208.9720
1 Salary_0.1 1217.9440
2 Salary_0.25 1244.8600
3 Salary_0.5 1289.7200
4 Salary_0.75 1334.5800
5 Salary_0.95 1370.4680
6 Salary_0.99 1377.6456
7 Amount_0.05 53.2800
8 Amount_0.1 53.5600
9 Amount_0.25 54.4000
10 Amount_0.5 55.8000
11 Amount_0.75 57.2000
12 Amount_0.95 58.3200
13 Amount_0.99 58.5440
如何才能获得所需格式的输出而无需执行额外数据 manipulation/formatting 或减少代码行数?
您可以尝试 pivot
然后 quantile
:
(df.pivot(columns='Year')
.quantile([0.01,0.05,0.75, 0.95, 0.99])
.stack('Year')
)
输出:
Salary Amount
Year
0.01 2019 1269.08 58.20
2020 3467.26 456.80
2021 4609.02 131.88
0.05 2019 1545.40 79.00
2020 3564.30 464.00
2021 4785.10 167.40
0.75 2019 5060.00 440.00
2020 6261.50 601.00
2021 6771.50 671.00
0.95 2019 5384.80 541.60
2020 7545.90 645.80
2021 6775.90 754.20
0.99 2019 5449.76 561.92
2020 7802.78 654.76
2021 6776.78 770.84