Pandas groupby - 对用户进行分组并计算订阅类型

Pandas groupby - grouping users and counting the type of subscription

我正在尝试使用 pandas 对成员进行分组,以计算成员已购买的订阅类型的数量并获得每个成员的总花费。加载后数据类似于:

df = 

Member Nbr  Member Name-First   Member Name-Last        Date-Joined             Member Type         Amount  Addr-Formatted  Date-Birth              Gender      Status    
1           Aboud               Tordon                  2010-03-31 00:00:00     1 Year Membership   331.00  ADDRESS_1       1972-08-01 00:00:00     Male        Active  
1           Aboud               Tordon                  2011-04-16 00:00:00     1 Year Membership   334.70  ADDRESS_1       1972-08-01 00:00:00     Male        Active  
1           Aboud               Tordon                  2012-08-06 00:00:00     1 Year Membership   344.34  ADDRESS_1       1972-08-01 00:00:00     Male        Active  
1           Aboud               Tordon                  2013-08-21 00:00:00     1 Year Membership   362.53  ADDRESS_1       1972-08-01 00:00:00     Male        Active  
1           Aboud               Tordon                  2015-08-31 00:00:00     1 Year Membership   289.47  ADDRESS_1       1972-08-01 00:00:00     Male        Active  

2          Jean                 Manuel                  2012-12-10 00:00:00     4 Month Membership  148.79  ADDRESS_2       1984-08-01 00:00:00     Male        In-Active   
2          Jean                 Manuel                  2013-03-13 00:00:00     1 Year Membership   348.46  ADDRESS_2       1984-08-01 00:00:00     Male        In-Active
2          Jean                 Manuel                  2014-03-15 00:00:00     1 Year Membership   316.86  ADDRESS_2       1984-08-01 00:00:00     Male        In-Active   

3          Val                  Adams                   2010-02-09 00:00:00     1 Year Membership   333.25  ADDRESS_3       1934-10-26 00:00:00     Female      Active  
3          Val                  Adams                   2011-03-09 00:00:00     1 Year Membership   333.88  ADDRESS_3       1934-10-26 00:00:00     Female      Active
3          Val                  Adams                   2012-04-03 00:00:00     1 Year Membership   318.34  ADDRESS_3       1934-10-26 00:00:00     Female      Active
3          Val                  Adams                   2013-04-15 00:00:00     1 Year Membership   350.73  ADDRESS_3       1934-10-26 00:00:00     Female      Active  
3          Val                  Adams                   2014-04-19 00:00:00     1 Year Membership   291.63  ADDRESS_3       1934-10-26 00:00:00     Female      Active  
3          Val                  Adams                   2015-04-19 00:00:00     1 Year Membership   247.35  ADDRESS_3       1934-10-26 00:00:00     Female      Active

5          Michele              Younes                  2010-02-14 00:00:00     1 Year Membership   333.25  ADDRESS_4       1933-06-23 00:00:00     Female      In-Active   
5          Michele              Younes                  2011-05-23 00:00:00     1 Year Membership   317.77  ADDRESS_4       1933-06-23 00:00:00     Female      In-Active   
5          Michele              Younes                  2012-05-28 00:00:00     1 Year Membership   328.16  ADDRESS_4       1933-06-23 00:00:00     Female      In-Active   
5          Michele              Younes                  2013-05-31 00:00:00     1 Year Membership   360.02  ADDRESS_4       1933-06-23 00:00:00     Female      In-Active

7          Adam                 Herzburg                2010-07-11 00:00:00     1 Year Membership   335 48  ADDRESS_5       1987-08-30 00:00:00     Male        In-Active
...

因为最受欢迎的 Member Type1 Month3 Month4 Month6 Month1 Year 我想制作一列,计算给定会员已购买的 Member Type 的数量。

还有2 Month5 Month7 Month8 MonthPool-OnlyMember Type等很少出现,如果会员有那种类型的合同,我想将其算作 'Misc'。

我还试图获得一个 'Total' 列,该列汇总了给定会员花费的总金额。

基本上我想将我以前的数据框转换为类似于:

df1=
Member Nbr  Member Name-First   Member Name-Last    1_Month  3_Month  4_Month  6_Month  1_Year  Misc    Total    Addr-Formatted Date-Birth           Gender     Status
1           Aboud               Tordon              0        0        0        0        5       0       1662.04  ADDRESS_1      1972-08-01 00:00:00  Male       Active
2           Jean                Manuel              0        0        1        0        2       0       813.86   ADDRESS_2      1984-08-01 00:00:00  Male       In-Active
3           Val                 Adams               0        0        0        0        6       0       1875.18  ADDRESS_3      1934-10-26 00:00:00  Female     Active
5           Michele             Younes              0        0        0        0        4       0       1339.20  ADDRESS_4      1933-06-23 00:00:00  Female     In-Active
7           Adam                Herzburg            0        0        0        0        1       0       335.48   ADDRESS_5      1933-06-23 00:00:00  Male       In-Active

...

我遇到的问题是,每当我使用 groupby 时,我只能对金额求和,或者单独计算一种特定类型的合同,但我做不到让它看起来像 df1.

你可以先map values of column Member Type by dict d and then fillna按值Misc:

d = {'1 Year Membership':'1_Year','1 Month Membership':'1_Month', '3 Month Membership':'3_Month', '4 Month Membership':'4_Month', '6 Month Membership':'6_Month'}
df['Type'] = df['Member Type'].map(d).fillna('Misc')
#print (df)

然后 groupby 并汇总 sum:

df0 = df.groupby(['Member Nbr','Member Name-First','Member Name-Last','Addr-Formatted','Date-Birth','Gender','Status'])['Amount'].sum()
#print (df0)

将列 Type 添加到分组列列表并聚合 size, then reshape by unstack:

df1 = df.groupby(['Member Nbr','Member Name-First','Member Name-Last','Addr-Formatted','Date-Birth','Gender','Status', 'Type']).size().unstack(fill_value=0)
#print (df1)

最后 concat 两个 DataFrames:

print (pd.concat([df0, df1], axis=1).reset_index())
   Member Nbr Member Name-First Member Name-Last Addr-Formatted  \
0           1             Aboud           Tordon      ADDRESS_1   
1           2              Jean           Manuel      ADDRESS_2   
2           3               Val            Adams      ADDRESS_3   
3           5           Michele           Younes      ADDRESS_4   
4           7              Adam         Herzburg      ADDRESS_5   

            Date-Birth  Gender     Status   Amount  1_Year  4_Month  
0  1972-08-01 00:00:00    Male     Active  1662.04       5        0  
1  1984-08-01 00:00:00    Male  In-Active   814.11       2        1  
2  1934-10-26 00:00:00  Female     Active  1875.18       6        0  
3  1933-06-23 00:00:00  Female  In-Active  1339.20       4        0  
4  1987-08-30 00:00:00    Male  In-Active   335.48       1        0  

编辑:

如果 Member Type 列中缺少某些值,则需要添加 reindex:

df1 = df.groupby(['Member Nbr','Member Name-First','Member Name-Last','Addr-Formatted','Date-Birth','Gender','Status', 'Type']).size().unstack(fill_value=0).reindex(columns=d.values(), fill_value=0)
#print (df1)

print (pd.concat([df0, df1], axis=1).reset_index())
   Member Nbr Member Name-First Member Name-Last Addr-Formatted  \
0           1             Aboud           Tordon      ADDRESS_1   
1           2              Jean           Manuel      ADDRESS_2   
2           3               Val            Adams      ADDRESS_3   
3           5           Michele           Younes      ADDRESS_4   
4           7              Adam         Herzburg      ADDRESS_5   

            Date-Birth  Gender     Status   Amount  6_Month  3_Month  4_Month  \
0  1972-08-01 00:00:00    Male     Active  1662.04        0        0        0   
1  1984-08-01 00:00:00    Male  In-Active   814.11        0        0        1   
2  1934-10-26 00:00:00  Female     Active  1875.18        0        0        0   
3  1933-06-23 00:00:00  Female  In-Active  1339.20        0        0        0   
4  1987-08-30 00:00:00    Male  In-Active   335.48        0        0        0   

   1_Year  1_Month  
0       5        0  
1       2        0  
2       6        0  
3       4        0  
4       1        0  

第二个groupby(最快的)可以使用pivot_table:

df2 = df.pivot_table(index=['Member Nbr','Member Name-First','Member Name-Last','Addr-Formatted','Date-Birth','Gender','Status'], columns='Type', values='Amount', aggfunc=len, fill_value=0).reindex(columns=d.values(), fill_value=0)
print (pd.concat([df0, df2], axis=1).reset_index())
   Member Nbr Member Name-First Member Name-Last Addr-Formatted  \
0           1             Aboud           Tordon      ADDRESS_1   
1           2              Jean           Manuel      ADDRESS_2   
2           3               Val            Adams      ADDRESS_3   
3           5           Michele           Younes      ADDRESS_4   
4           7              Adam         Herzburg      ADDRESS_5   

            Date-Birth  Gender     Status   Amount  6_Month  3_Month  4_Month  \
0  1972-08-01 00:00:00    Male     Active  1662.04        0        0        0   
1  1984-08-01 00:00:00    Male  In-Active   814.11        0        0        1   
2  1934-10-26 00:00:00  Female     Active  1875.18        0        0        0   
3  1933-06-23 00:00:00  Female  In-Active  1339.20        0        0        0   
4  1987-08-30 00:00:00    Male  In-Active   335.48        0        0        0   

   1_Year  1_Month  
0       5        0  
1       2        0  
2       6        0  
3       4        0  
4       1        0