根据自定义函数聚合数据框中的多列

Aggregate multiple columns in a dataframe based on custom functions

大家下午好,

我已经尝试解决这个问题一段时间了,如有任何帮助,我们将不胜感激。

这是我的数据框:

Channel state       rfq_qty
A        Done       10
B        Tied Done  10
C        Done       10
C        Done       10
C        Done       10
C        Tied Done  10
B        Done       10
B        Done       10

I would like to:

  1. Group by channel, then state
  2. Sum the rfq_qty for each channel
  3. Count the occurences of each 'done' string in state ('Done' is treated the same as 'Tied Done' i.e. anything with 'done' in it)
  4. Display the channels rfq_qty as a percentage of the total number of rfq_qty (80)
Channel state   rfq_qty Percentage
A         1       10    0.125
B         3       30    0.375
C         4       40    0.5

I have attempted this with the following:

df_Done = df[
                (
                    df['state']=='Done'
                ) 
                | 
                (
                    df['state'] == 'Tied Done'
                )
            ][['Channel','state','rfq_qty']]

df_Done['Percentage_Qty']= df_Done['rfq_qty']/df_Done['rfq_qty'].sum()
df_Done['Done_Trades']= df_Done['state'].count()

display(
        df_Done[
                (df_Done['Channel'] != 0)
               ].groupby(['Channel'])['Channel','Count of Done','rfq_qty','Percentage_Qty'].sum().sort_values(['rfq_qty'], ascending=False)
       )

Works but looks convoluted. Any improvements?

我想你可以使用:

  • 首先按 isinloc
  • 筛选
  • groupby and aggregate by agg 包含新列名称和函数的元组
  • Percentage 除以 divsum
  • 如有必要,最后 sort_values rfq_qty

df_Done = df.loc[df['state'].isin(['Done', 'Tied Done']), ['Channel','state','rfq_qty']]

#if want filter all values contains Done
#df_Done = df[df['state'].str.contains('Done')]

#if necessary filter out Channel == 0
#mask = (df['Channel'] != 0) & df['state'].isin(['Done', 'Tied Done'])
#df_Done = df.loc[mask, ['Channel','state','rfq_qty']]

d = {('rfq_qty', 'sum'), ('Done_Trades','size')}
df = df_Done.groupby('Channel')['rfq_qty'].agg(d).reset_index()
df['Percentage'] = df['rfq_qty'].div(df['rfq_qty'].sum())
df = df.sort_values('rfq_qty')
print (df)
  Channel  Done_Trades  rfq_qty  Percentage
0       A            1       10       0.125
1       B            3       30       0.375
2       C            4       40       0.500

一种方法是使用单个 df.groupby.agg 并重命名列:

import pandas as pd

df = pd.DataFrame({'Channel': ['A', 'B', 'C', 'C', 'C', 'C', 'B', 'B'],
                   'state': ['Done', 'Tied Done', 'Done', 'Done', 'Done', 'Tied Done', 'Done', 'Done'],
                   'rfq_qty': [10, 10, 10, 10, 10, 10, 10, 10]})

agg_funcs = {'state': lambda x: x[x.str.contains('Done')].count(),
             'rfq_qty': ['sum', lambda x: x.sum() / df['rfq_qty'].sum()]}

res = df.groupby('Channel').agg(agg_funcs).reset_index()
res.columns = ['Channel', 'state', 'rfq_qty', 'Percentage']

#   Channel  state  rfq_qty  Percentage
# 0       A      1       10       0.125
# 1       B      3       30       0.375
# 2       C      4       40       0.500

这不是最有效的方式,因为它依赖于non-vectorised聚合,但如果它对您的用例来说是高性能的,它可能是一个不错的选择.