提高groupby效率,应用自定义函数提升性能
Improve efficiency of groupby and apply custom function to improve performance
我有以下形式的销售数据:
ID,year,sales,bonus
1,2015,1000,500
1,2015,1400,590
1,2016,1100,200
1,2017,1200,800
1,2017,1700,300
1,2017,1900,510
1,2018,2000,560
1,2018,1700,600
2,2015,2000,400
2,2015,1450,580
2,2015,2100,300
2,2017,1400,770
2,2017,2700,330
2,2018,3900,610
2,2018,2000,530
2,2018,1700,700
3,2015,2900,406
3,2015,1450,580
3,2015,2100,300
3,2017,1450,777
3,2018,3100,330
3,2018,3900,610
3,2019,2000,530
3,2019,1900,730
我想包括两个新列来捕获每年每个 ID 的销售额和奖金列的第 10 个和第 90 个百分位数(我很欣赏这在这个简化的数据集中没有多大意义)
我使用了以下方法:
def aggs(df, names):
for i in range(0, len(names)):
df[names[i] + '90'] = df[names[i]].quantile(.90)
df[names[i] + '10'] = df[names[i]].quantile(.10)
return df
sales = pd.read_csv("sales.csv")
names = ['sales', 'bonus']
sales2 = sales.groupby(['year', 'ID'], as_index=False)
sales2 = sales2.apply(aggs, names)
生产
ID year sales bonus sales90 sales10 bonus90 bonus10
0 1 2015 1000 500 1360.0 1040.0 581.0 509.0
1 1 2015 1400 590 1360.0 1040.0 581.0 509.0
2 1 2016 1100 200 1100.0 1100.0 200.0 200.0
3 1 2017 1200 800 1860.0 1300.0 742.0 342.0
4 1 2017 1700 300 1860.0 1300.0 742.0 342.0
5 1 2017 1900 510 1860.0 1300.0 742.0 342.0
6 1 2018 2000 560 1970.0 1730.0 596.0 564.0
7 1 2018 1700 600 1970.0 1730.0 596.0 564.0
8 2 2015 2000 400 2080.0 1560.0 544.0 320.0
9 2 2015 1450 580 2080.0 1560.0 544.0 320.0
10 2 2015 2100 300 2080.0 1560.0 544.0 320.0
11 2 2017 1400 770 2570.0 1530.0 726.0 374.0
12 2 2017 2700 330 2570.0 1530.0 726.0 374.0
13 2 2018 3900 610 3520.0 1760.0 682.0 546.0
14 2 2018 2000 530 3520.0 1760.0 682.0 546.0
15 2 2018 1700 700 3520.0 1760.0 682.0 546.0
16 3 2015 2900 406 2740.0 1580.0 545.2 321.2
17 3 2015 1450 580 2740.0 1580.0 545.2 321.2
18 3 2015 2100 300 2740.0 1580.0 545.2 321.2
19 3 2017 1450 777 1450.0 1450.0 777.0 777.0
20 3 2018 3100 330 3820.0 3180.0 582.0 358.0
21 3 2018 3900 610 3820.0 3180.0 582.0 358.0
22 3 2019 2000 530 1990.0 1910.0 710.0 550.0
问题
代码按预期运行,但随着行数和列数的增加,出现了很大的性能问题。我也收到警告:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()
任何人都可以建议我如何改进这种方法吗?谢谢!
df.quantile()
方法可以传递一个分位数的集合来计算。它在索引上这样做(默认情况下)。
- 每个分位数都作为一个级别添加到索引中。因此,将您的 groupby terms (year, ID) 作为索引,然后将其解压以将其作为 multi-level 列。
将列从 Multi-level 更改为 single-level 并重命名。
merge()
允许在有共同列时轻松合并,因此将其与 qs.reset_index()
一起使用以将年份和 ID 返回为列。
qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
result = sales.merge(qs.reset_index())
中间步骤中的数据帧如下:
qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
# qs is:
sales bonus
0.9 0.1 0.9 0.1
year ID
2015 1 1360.0 1040.0 581.0 509.0
2 2080.0 1560.0 544.0 320.0
3 2740.0 1580.0 545.2 321.2
2016 1 1100.0 1100.0 200.0 200.0
2017 1 1860.0 1300.0 742.0 342.0
2 2570.0 1530.0 726.0 374.0
3 1450.0 1450.0 777.0 777.0
2018 1 1970.0 1730.0 596.0 564.0
2 3520.0 1760.0 682.0 546.0
3 3820.0 3180.0 582.0 358.0
2019 3 1990.0 1910.0 710.0 550.0
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
# qs is now
sales90 sales10 bonus90 bonus10
year ID
2015 1 1360.0 1040.0 581.0 509.0
2 2080.0 1560.0 544.0 320.0
3 2740.0 1580.0 545.2 321.2
2016 1 1100.0 1100.0 200.0 200.0
2017 1 1860.0 1300.0 742.0 342.0
...
result = sales.merge(qs.reset_index())
# result:
year ID sales bonus sales90 sales10 bonus90 bonus10
0 2015 1 1000 500 1360.0 1040.0 581.0 509.0
1 2015 1 1400 590 1360.0 1040.0 581.0 509.0
2 2015 2 2000 400 2080.0 1560.0 544.0 320.0
3 2015 2 1450 580 2080.0 1560.0 544.0 320.0
4 2015 2 2100 300 2080.0 1560.0 544.0 320.0
5 2015 3 2900 406 2740.0 1580.0 545.2 321.2
6 2015 3 1450 580 2740.0 1580.0 545.2 321.2
7 2015 3 2100 300 2740.0 1580.0 545.2 321.2
8 2016 1 1100 200 1100.0 1100.0 200.0 200.0
9 2017 1 1200 800 1860.0 1300.0 742.0 342.0
10 2017 1 1700 300 1860.0 1300.0 742.0 342.0
11 2017 1 1900 510 1860.0 1300.0 742.0 342.0
12 2017 2 1400 770 2570.0 1530.0 726.0 374.0
13 2017 2 2700 330 2570.0 1530.0 726.0 374.0
14 2017 3 1450 777 1450.0 1450.0 777.0 777.0
15 2018 1 2000 560 1970.0 1730.0 596.0 564.0
16 2018 1 1700 600 1970.0 1730.0 596.0 564.0
17 2018 2 3900 610 3520.0 1760.0 682.0 546.0
18 2018 2 2000 530 3520.0 1760.0 682.0 546.0
19 2018 2 1700 700 3520.0 1760.0 682.0 546.0
20 2018 3 3100 330 3820.0 3180.0 582.0 358.0
21 2018 3 3900 610 3820.0 3180.0 582.0 358.0
22 2019 3 2000 530 1990.0 1910.0 710.0 550.0
23 2019 3 1900 730 1990.0 1910.0 710.0 550.0
我有以下形式的销售数据:
ID,year,sales,bonus
1,2015,1000,500
1,2015,1400,590
1,2016,1100,200
1,2017,1200,800
1,2017,1700,300
1,2017,1900,510
1,2018,2000,560
1,2018,1700,600
2,2015,2000,400
2,2015,1450,580
2,2015,2100,300
2,2017,1400,770
2,2017,2700,330
2,2018,3900,610
2,2018,2000,530
2,2018,1700,700
3,2015,2900,406
3,2015,1450,580
3,2015,2100,300
3,2017,1450,777
3,2018,3100,330
3,2018,3900,610
3,2019,2000,530
3,2019,1900,730
我想包括两个新列来捕获每年每个 ID 的销售额和奖金列的第 10 个和第 90 个百分位数(我很欣赏这在这个简化的数据集中没有多大意义)
我使用了以下方法:
def aggs(df, names):
for i in range(0, len(names)):
df[names[i] + '90'] = df[names[i]].quantile(.90)
df[names[i] + '10'] = df[names[i]].quantile(.10)
return df
sales = pd.read_csv("sales.csv")
names = ['sales', 'bonus']
sales2 = sales.groupby(['year', 'ID'], as_index=False)
sales2 = sales2.apply(aggs, names)
生产
ID year sales bonus sales90 sales10 bonus90 bonus10
0 1 2015 1000 500 1360.0 1040.0 581.0 509.0
1 1 2015 1400 590 1360.0 1040.0 581.0 509.0
2 1 2016 1100 200 1100.0 1100.0 200.0 200.0
3 1 2017 1200 800 1860.0 1300.0 742.0 342.0
4 1 2017 1700 300 1860.0 1300.0 742.0 342.0
5 1 2017 1900 510 1860.0 1300.0 742.0 342.0
6 1 2018 2000 560 1970.0 1730.0 596.0 564.0
7 1 2018 1700 600 1970.0 1730.0 596.0 564.0
8 2 2015 2000 400 2080.0 1560.0 544.0 320.0
9 2 2015 1450 580 2080.0 1560.0 544.0 320.0
10 2 2015 2100 300 2080.0 1560.0 544.0 320.0
11 2 2017 1400 770 2570.0 1530.0 726.0 374.0
12 2 2017 2700 330 2570.0 1530.0 726.0 374.0
13 2 2018 3900 610 3520.0 1760.0 682.0 546.0
14 2 2018 2000 530 3520.0 1760.0 682.0 546.0
15 2 2018 1700 700 3520.0 1760.0 682.0 546.0
16 3 2015 2900 406 2740.0 1580.0 545.2 321.2
17 3 2015 1450 580 2740.0 1580.0 545.2 321.2
18 3 2015 2100 300 2740.0 1580.0 545.2 321.2
19 3 2017 1450 777 1450.0 1450.0 777.0 777.0
20 3 2018 3100 330 3820.0 3180.0 582.0 358.0
21 3 2018 3900 610 3820.0 3180.0 582.0 358.0
22 3 2019 2000 530 1990.0 1910.0 710.0 550.0
问题
代码按预期运行,但随着行数和列数的增加,出现了很大的性能问题。我也收到警告:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()
任何人都可以建议我如何改进这种方法吗?谢谢!
df.quantile()
方法可以传递一个分位数的集合来计算。它在索引上这样做(默认情况下)。- 每个分位数都作为一个级别添加到索引中。因此,将您的 groupby terms (year, ID) 作为索引,然后将其解压以将其作为 multi-level 列。
将列从 Multi-level 更改为 single-level 并重命名。
merge()
允许在有共同列时轻松合并,因此将其与qs.reset_index()
一起使用以将年份和 ID 返回为列。
qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
result = sales.merge(qs.reset_index())
中间步骤中的数据帧如下:
qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
# qs is:
sales bonus
0.9 0.1 0.9 0.1
year ID
2015 1 1360.0 1040.0 581.0 509.0
2 2080.0 1560.0 544.0 320.0
3 2740.0 1580.0 545.2 321.2
2016 1 1100.0 1100.0 200.0 200.0
2017 1 1860.0 1300.0 742.0 342.0
2 2570.0 1530.0 726.0 374.0
3 1450.0 1450.0 777.0 777.0
2018 1 1970.0 1730.0 596.0 564.0
2 3520.0 1760.0 682.0 546.0
3 3820.0 3180.0 582.0 358.0
2019 3 1990.0 1910.0 710.0 550.0
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
# qs is now
sales90 sales10 bonus90 bonus10
year ID
2015 1 1360.0 1040.0 581.0 509.0
2 2080.0 1560.0 544.0 320.0
3 2740.0 1580.0 545.2 321.2
2016 1 1100.0 1100.0 200.0 200.0
2017 1 1860.0 1300.0 742.0 342.0
...
result = sales.merge(qs.reset_index())
# result:
year ID sales bonus sales90 sales10 bonus90 bonus10
0 2015 1 1000 500 1360.0 1040.0 581.0 509.0
1 2015 1 1400 590 1360.0 1040.0 581.0 509.0
2 2015 2 2000 400 2080.0 1560.0 544.0 320.0
3 2015 2 1450 580 2080.0 1560.0 544.0 320.0
4 2015 2 2100 300 2080.0 1560.0 544.0 320.0
5 2015 3 2900 406 2740.0 1580.0 545.2 321.2
6 2015 3 1450 580 2740.0 1580.0 545.2 321.2
7 2015 3 2100 300 2740.0 1580.0 545.2 321.2
8 2016 1 1100 200 1100.0 1100.0 200.0 200.0
9 2017 1 1200 800 1860.0 1300.0 742.0 342.0
10 2017 1 1700 300 1860.0 1300.0 742.0 342.0
11 2017 1 1900 510 1860.0 1300.0 742.0 342.0
12 2017 2 1400 770 2570.0 1530.0 726.0 374.0
13 2017 2 2700 330 2570.0 1530.0 726.0 374.0
14 2017 3 1450 777 1450.0 1450.0 777.0 777.0
15 2018 1 2000 560 1970.0 1730.0 596.0 564.0
16 2018 1 1700 600 1970.0 1730.0 596.0 564.0
17 2018 2 3900 610 3520.0 1760.0 682.0 546.0
18 2018 2 2000 530 3520.0 1760.0 682.0 546.0
19 2018 2 1700 700 3520.0 1760.0 682.0 546.0
20 2018 3 3100 330 3820.0 3180.0 582.0 358.0
21 2018 3 3900 610 3820.0 3180.0 582.0 358.0
22 2019 3 2000 530 1990.0 1910.0 710.0 550.0
23 2019 3 1900 730 1990.0 1910.0 710.0 550.0