dplyr groupby 百分比并重命名列

Question

我想根据提供的促销活动按我的数据框分组并计算百分比。数据框格式如下

Promotion name             days rented
nan                        577
first month half off       88
nan                        22
second month free          55
nan                        60
first month half off       20

如果我的数据框名为 df。我将如何按促销名称分组并计算天数百分比并重命名该列。因此，我的第一列是“少于 1 个月的租金数量”。在 R 中，我会说：

df %>% group_by(`Promotion Name`) %>% 
summarise("# Rentals < 1 month" = sum(`Days rented` <= 30)/length(`Days rented`)

有人可以在 python 中提供帮助吗？我尝试了以下方法：

我希望格式为：

Promotion Name         # rentals < 1 month    # rentals < 2 month   # rentals < 3 months
None                   0.0023                 0.005                0.28
First month half off   0.78                   0.22                 0.76
2nd month free         0.44     etc

我试过了

rented_df.groupby('Promotion Name').sum()

但这并没有给我想要的结果，因为我想对小于 30 天的天数求和并计算长度，最后重命名该列。谢谢。

Answer 1

我觉得你需要groupby with custom function with boolean indexing:

df = rented_df.groupby('Promotion name')['days rented']
              .apply(lambda x: x[x<=30].sum()/len(x)).reset_index(name='# Rentals < 1 month')
print (df)
         Promotion name  # Rentals < 1 month
1  first month half off            10.000000
2     second month free             0.000000

但是 groupby 默认删除 NaNs，所以如果需要它们，请先将 NaN 替换为 fillna:

之前列中没有的字符串

rented_df['Promotion name'] = rented_df['Promotion name'].fillna('NANS strings')
df = rented_df.groupby('Promotion name')['days rented']
              .apply(lambda x: x[x<=30].sum()/len(x)).reset_index(name='# Rentals < 1 month')
print (df)
         Promotion name  # Rentals < 1 month
0          NANS strings             7.333333
1  first month half off            10.000000
2     second month free             0.000000

对于单独的列需要transform:

rented_df['Promotion name'] = rented_df['Promotion name'].fillna('NANS strings')
rented_df['# Rentals < 1 month'] = rented_df.groupby('Promotion name')['days rented']
                                            .transform(lambda x: x[x<=30].sum()/len(x))
print (rented_df)
         Promotion name  days rented  # Rentals < 1 month
0          NANS strings          577             7.333333
1  first month half off           88            10.000000
2          NANS strings           22             7.333333
3     second month free           55             0.000000
4          NANS strings           60             7.333333
5  first month half off           20            10.000000

编辑：

rented_df['Promotion name'] = rented_df['Promotion name'].fillna('NANS strings')
g = rented_df.groupby('Promotion name')['days rented']
s1 = g.apply(lambda x: x[x<=30].sum()/len(x)).rename('# Rentals < 1 month')
s2 = g.apply(lambda x: x[x<=60].sum()/len(x)).rename('# Rentals < 2 month')
s3 = g.apply(lambda x: x[x<=90].sum()/len(x)).rename('# Rentals < 3 month')
df = pd.concat([s1,s2,s3], axis=1).reset_index()
print (df)
         Promotion name  # Rentals < 1 month  # Rentals < 2 month  \
0          NANS strings             7.333333            27.333333   
1  first month half off            10.000000            10.000000   
2     second month free             0.000000            55.000000   

   # Rentals < 3 month  
0            27.333333  
1            54.000000  
2            55.000000

dplyr groupby 百分比并重命名列

dplyr groupby percentage and renaming the column

python

python-2.7

pandas

pandas-groupby