根据自定义函数聚合数据框中的多列
Aggregate multiple columns in a dataframe based on custom functions
大家下午好,
我已经尝试解决这个问题一段时间了,如有任何帮助,我们将不胜感激。
这是我的数据框:
Channel state rfq_qty
A Done 10
B Tied Done 10
C Done 10
C Done 10
C Done 10
C Tied Done 10
B Done 10
B Done 10
I would like to:
- Group by channel, then state
- Sum the rfq_qty for each channel
- Count the occurences of each 'done' string in state ('Done' is treated the same as 'Tied Done' i.e. anything with 'done' in it)
- Display the channels rfq_qty as a percentage of the total number of rfq_qty (80)
Channel state rfq_qty Percentage
A 1 10 0.125
B 3 30 0.375
C 4 40 0.5
I have attempted this with the following:
df_Done = df[
(
df['state']=='Done'
)
|
(
df['state'] == 'Tied Done'
)
][['Channel','state','rfq_qty']]
df_Done['Percentage_Qty']= df_Done['rfq_qty']/df_Done['rfq_qty'].sum()
df_Done['Done_Trades']= df_Done['state'].count()
display(
df_Done[
(df_Done['Channel'] != 0)
].groupby(['Channel'])['Channel','Count of Done','rfq_qty','Percentage_Qty'].sum().sort_values(['rfq_qty'], ascending=False)
)
Works but looks convoluted. Any improvements?
我想你可以使用:
- 首先按
isin
和 loc
筛选
groupby
and aggregate by agg
包含新列名称和函数的元组
- 加
Percentage
除以 div
和 sum
- 如有必要,最后
sort_values
rfq_qty
df_Done = df.loc[df['state'].isin(['Done', 'Tied Done']), ['Channel','state','rfq_qty']]
#if want filter all values contains Done
#df_Done = df[df['state'].str.contains('Done')]
#if necessary filter out Channel == 0
#mask = (df['Channel'] != 0) & df['state'].isin(['Done', 'Tied Done'])
#df_Done = df.loc[mask, ['Channel','state','rfq_qty']]
d = {('rfq_qty', 'sum'), ('Done_Trades','size')}
df = df_Done.groupby('Channel')['rfq_qty'].agg(d).reset_index()
df['Percentage'] = df['rfq_qty'].div(df['rfq_qty'].sum())
df = df.sort_values('rfq_qty')
print (df)
Channel Done_Trades rfq_qty Percentage
0 A 1 10 0.125
1 B 3 30 0.375
2 C 4 40 0.500
一种方法是使用单个 df.groupby.agg
并重命名列:
import pandas as pd
df = pd.DataFrame({'Channel': ['A', 'B', 'C', 'C', 'C', 'C', 'B', 'B'],
'state': ['Done', 'Tied Done', 'Done', 'Done', 'Done', 'Tied Done', 'Done', 'Done'],
'rfq_qty': [10, 10, 10, 10, 10, 10, 10, 10]})
agg_funcs = {'state': lambda x: x[x.str.contains('Done')].count(),
'rfq_qty': ['sum', lambda x: x.sum() / df['rfq_qty'].sum()]}
res = df.groupby('Channel').agg(agg_funcs).reset_index()
res.columns = ['Channel', 'state', 'rfq_qty', 'Percentage']
# Channel state rfq_qty Percentage
# 0 A 1 10 0.125
# 1 B 3 30 0.375
# 2 C 4 40 0.500
这不是最有效的方式,因为它依赖于non-vectorised聚合,但如果它对您的用例来说是高性能的,它可能是一个不错的选择.
大家下午好,
我已经尝试解决这个问题一段时间了,如有任何帮助,我们将不胜感激。
这是我的数据框:
Channel state rfq_qty
A Done 10
B Tied Done 10
C Done 10
C Done 10
C Done 10
C Tied Done 10
B Done 10
B Done 10
I would like to:
- Group by channel, then state
- Sum the rfq_qty for each channel
- Count the occurences of each 'done' string in state ('Done' is treated the same as 'Tied Done' i.e. anything with 'done' in it)
- Display the channels rfq_qty as a percentage of the total number of rfq_qty (80)
Channel state rfq_qty Percentage
A 1 10 0.125
B 3 30 0.375
C 4 40 0.5
I have attempted this with the following:
df_Done = df[
(
df['state']=='Done'
)
|
(
df['state'] == 'Tied Done'
)
][['Channel','state','rfq_qty']]
df_Done['Percentage_Qty']= df_Done['rfq_qty']/df_Done['rfq_qty'].sum()
df_Done['Done_Trades']= df_Done['state'].count()
display(
df_Done[
(df_Done['Channel'] != 0)
].groupby(['Channel'])['Channel','Count of Done','rfq_qty','Percentage_Qty'].sum().sort_values(['rfq_qty'], ascending=False)
)
Works but looks convoluted. Any improvements?
我想你可以使用:
- 首先按
isin
和loc
筛选
groupby
and aggregate byagg
包含新列名称和函数的元组- 加
Percentage
除以div
和sum
- 如有必要,最后
sort_values
rfq_qty
df_Done = df.loc[df['state'].isin(['Done', 'Tied Done']), ['Channel','state','rfq_qty']]
#if want filter all values contains Done
#df_Done = df[df['state'].str.contains('Done')]
#if necessary filter out Channel == 0
#mask = (df['Channel'] != 0) & df['state'].isin(['Done', 'Tied Done'])
#df_Done = df.loc[mask, ['Channel','state','rfq_qty']]
d = {('rfq_qty', 'sum'), ('Done_Trades','size')}
df = df_Done.groupby('Channel')['rfq_qty'].agg(d).reset_index()
df['Percentage'] = df['rfq_qty'].div(df['rfq_qty'].sum())
df = df.sort_values('rfq_qty')
print (df)
Channel Done_Trades rfq_qty Percentage
0 A 1 10 0.125
1 B 3 30 0.375
2 C 4 40 0.500
一种方法是使用单个 df.groupby.agg
并重命名列:
import pandas as pd
df = pd.DataFrame({'Channel': ['A', 'B', 'C', 'C', 'C', 'C', 'B', 'B'],
'state': ['Done', 'Tied Done', 'Done', 'Done', 'Done', 'Tied Done', 'Done', 'Done'],
'rfq_qty': [10, 10, 10, 10, 10, 10, 10, 10]})
agg_funcs = {'state': lambda x: x[x.str.contains('Done')].count(),
'rfq_qty': ['sum', lambda x: x.sum() / df['rfq_qty'].sum()]}
res = df.groupby('Channel').agg(agg_funcs).reset_index()
res.columns = ['Channel', 'state', 'rfq_qty', 'Percentage']
# Channel state rfq_qty Percentage
# 0 A 1 10 0.125
# 1 B 3 30 0.375
# 2 C 4 40 0.500
这不是最有效的方式,因为它依赖于non-vectorised聚合,但如果它对您的用例来说是高性能的,它可能是一个不错的选择.