Groupby Sum returns 错误的总和值，因为它已乘以 Pandas

Question

这是一个示例代码：

    import pandas as pd

data = {'Date': ['10/10/21', '10/10/21', '13/10/21', '11/10/21', '11/10/21', '11/10/21', '11/10/21', '11/10/21', '13/10/21', '13/10/21', '13/10/21', '10/10/21', '10/10/21'],
      'ID': [1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      'TotalTimeSpentInMinutes': [19, 6, 14, 17, 51, 53, 66, 19, 14, 28, 44, 22, 41],
      'Vehicle': ['V3', 'V1', 'V3', 'V1','V1','V1','V1','V1','V1','V1','V1','V1','V1']
      }

df = pd.DataFrame(data)

prices = {
    'V1': 9.99,
    'V2': 9.99,
    'V3': 14.00,
}

default_price = 9.99

df = df.sort_values('ID')

df['OrdersPD'] = df.groupby(['ID', 'Date', 'Vehicle'])['ID'].transform('count')

df['MinutesPD'] = df.groupby(['ID', 'Date', 'Vehicle'])['TotalTimeSpentInMinutes'].transform(sum)

df['HoursPD'] = df['MinutesPD'] / 60

df['Pay excl extra'] = df.apply(lambda x: prices[x.get('Vehicle', default_price)]*x['HoursPD'], axis=1).round(2)

extra = 1.20

df['Extra Pay'] = df.apply(lambda x: extra*x['OrdersPD'], axis=1)

df['Total_pay'] = df['Pay excl extra'] + df['Extra Pay'].round(2)

df['Total Pay PD'] = df.groupby(['ID'])['Total_pay'].transform(sum)
#Returns wrong sum

df['Total Courier Hours'] = df.groupby(['ID'])['HoursPD'].transform(sum)
#Returns wrong sum

df['ABS Final Pay'] = df.groupby(['ID'])['Total Pay PD'].transform(sum)
#Returns wrong sum

df.drop_duplicates((['ID','Date','Vehicle']), inplace=True)

print(df)

我正在尝试查找 2 项每个 ID 的总和：工时和工资。

这是我的代码，用于查找总小时数并支付

小时数：

df['Total Courier Hours'] = df.groupby(['ID'])['HoursPD'].transform(sum)
#I've also tried with just .sum() but it returns an empty column

支付：

df['ABS Final Pay'] = df.groupby(['ID'])['Total Pay PD'].transform(sum)

ID 1 的输出示例： - ABS Final Pay

Date      ID   Vehicle  OrdersPD  HoursPD  PayExclExtra  ExtraPay
10/10/21   1      V1       1      0.1          1           1.20
10/10/21   1      V3       1      0.3166      4.43         1.20 
13/10/21   1      V3       1      0.2333      3.27         1.20


Total_pay  Total Pay PD   Total Courier Hours    ABS Final Pay  
   2.20        12.30               0.65                  36.90
   5.63        12.30               0.65                  36.90  
   4.47        12.30               0.65                  36.90

2 列 Total Courier Hours 和 ABS Final Pay 是错误的，因为现在代码通过这样做计算总数：

ABS Final Pay = Total Pay PD * OrdersPD per count of ID 

Example: for 10/10/21 - it does 12.30 * 2 = 24.60
         for 13/10/21 - it does 12.30 * 1 = 12.30

ABS Final Pay returns 36.90 应该是 12.30 (7.83 + 4.47 from the 2 days)

ID 1 的总薪酬 PD 也是错误的，因为它应该显示每个日期的薪酬总和，预期输出示例：

Date      ID   Vehicle OrdersPD  Total PD
10/10/21   1     V1      1         7.83 
10/10/21   1     V3      1         7.83 
13/10/21   1     V1      1         4.47

当 ID 1 分成 3 行，每行 1 个订单时，总快递时间似乎没问题，但当它有超过 1 个订单时，它在乘以它时计算错误。

ID 2 示例 - 快递总时数

它计算它做这个总和：

Total Courier Hours = HoursPD * OrdersPD per count of ID 

Example: 11/10/21 - ID 2 had 5 orders, 2.85 * 5 = 14.25
         13/10/21 - 3 orders, 2.01 * 3 = 6.03
         10/10/21 - 2 orders, 1.05 * 2 = 2.1

快递总时数 returns 22.38 应该是 5.91 (2.85 + 2.01 + 1.05 from the 3 days)

抱歉这么久 post，我希望这是有道理的，并提前致谢。

Answer 1

drop_duplicates 行可能是问题所在。一旦我删除了代码：

df.drop_duplicates((['ID','Date','Vehicle']), inplace=True)

我能够更准确地逐行计算总数，而不必在代码中对列进行计算。

为了整齐地分开，我在不同的 excel sheet.

中按 groupby 打印了列

示例：

per_courier = (
    df.groupby(['ID'])['Total Pay']
    .agg(sum)
)

Groupby Sum returns 错误的总和值，因为它已乘以 Pandas

Groupby Sum returns the wrong sum value as it has been multiplied in Pandas

python

group-by

pandas

pandas-groupby