运行 在 pandas 中求和并指定行

Running sums in pandas with row specification

我有一些数据,我试图计算所有计数的总测量值以及计数 2、3 和 4 的测量值之和,对于每个批次的每个批次项目编号。理想情况下,我会在原始数据上有 2 个额外的列,其中包含总测量值和计数 2、3 和 4 的测量值——即使这些值会重复,它们也会用每条记录表示。这是数据集的示例:

Date    Sample Type Lot #   Lot item #  Count   Measurement
0   2021-12-05  G   ABS123-G    1   1   5.0
1   2021-12-05  G   ABS123-G    1   2   3.0
2   2021-12-05  G   ABS123-G    1   3   7.0
3   2021-12-05  G   ABS123-G    1   4   25.1
4   2021-12-05  G   ABS123-G    1   5   66.0
5   2021-12-05  G   ABS123-G    1   6   54.0
6   2021-12-05  G   ABS123-G    1   7   12.0
7   2021-12-05  G   ABS123-G    1   8   0.0
8   2021-12-05  G   ABS123-G    1   9   1.0
9   2021-12-05  G   ABS123-G    1   10  5.0
10  2021-12-05  G   ABS123-G    2   1   2.0
11  2021-12-05  G   ABS123-G    2   2   4.0
12  2021-12-05  G   ABS123-G    2   3   889.0
13  2021-12-05  G   ABS123-G    2   4   12.4
14  2021-12-05  G   ABS123-G    2   5   51.4
15  2021-12-05  G   ABS123-G    2   6   12.0
16  2021-12-05  G   ABS123-G    2   7   14.0
17  2021-12-05  G   ABS123-G    2   8   2.0
18  2021-12-05  G   ABS123-G    2   9   1.0
19  2021-12-05  G   ABS123-G    2   10  0.1
20  2021-12-05  B   ABS123-B    1   1   4.0
21  2021-12-05  B   ABS123-B    1   2   58.0
22  2021-12-05  B   ABS123-B    1   3   123.0
23  2021-12-05  B   ABS123-B    1   4   12.5
24  2021-12-05  B   ABS123-B    1   5   11.0
25  2021-12-05  B   ABS123-B    1   6   135.5
26  2021-12-05  B   ABS123-B    1   7   17.0
27  2021-12-05  B   ABS123-B    1   8   1.0
28  2021-12-05  B   ABS123-B    1   9   5.0
29  2021-12-05  B   ABS123-B    1   10  0.3

我的方法是尝试将计数过滤为 2、3、4,计算总和,然后根据批次和批次项目 # 将 df 连接到原始值,然后对总数做类似的事情。但是,当我尝试求和时 运行 出错了。

df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df2['Counts 2,3,4'] = df2.grouby(['Lot #, 'Lot item #'])['Measurement'].sum()
df2

TypeError: incompatible index of inserted column with frame index

过滤器有效,但第二部分无效。首先,我不知道是什么原因导致的错误,是否需要重新设置索引?另外,这是正确的方法吗?欢迎任何建议。

我能弄清楚是因为索引有问题。当我刚刚删除新的列名和 运行 groupby 时,它起作用了。然后我将索引重置为 groupby 并且我能够毫无问题地合并到原始 df 。与总数相同。

df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df3 = df2.groupby(['Lot #', 'Lot item #'])['Measurement'].sum()
df3 = df3.reset_index()

joined = pd.merge(df, df3, how='left', left_on=['Lot #', 'Lot item #'], right_on=['Lot #', 'Lot item #'])

我只是觉得一定有比这更优雅的解决方案?但也许不是?

我们可以使用isin to simplify the equality checks by defining a list of integer values. We can then use join after groupby sum and specify the columns to join on. Lastly rename这个Series来新的列名:

# Specify columns to groupby and join back on
grp_cols = ['Lot #', 'Lot item #']
joined = df.join(
    df[df['Count'].isin([2, 3, 4])]  # Values to include
        .groupby(grp_cols)['Measurement'].sum()  # Take sum per group
        .rename('Counts 2,3,4'),  # Specify new column name
    on=grp_cols,
)

joined:

          Date Sample Type     Lot #  Lot item #  Count  Measurement  Counts 2,3,4
0   2021-12-05           G  ABS123-G           1      1          5.0          35.1
1   2021-12-05           G  ABS123-G           1      2          3.0          35.1
2   2021-12-05           G  ABS123-G           1      3          7.0          35.1
3   2021-12-05           G  ABS123-G           1      4         25.1          35.1
4   2021-12-05           G  ABS123-G           1      5         66.0          35.1
5   2021-12-05           G  ABS123-G           1      6         54.0          35.1
6   2021-12-05           G  ABS123-G           1      7         12.0          35.1
7   2021-12-05           G  ABS123-G           1      8          0.0          35.1
8   2021-12-05           G  ABS123-G           1      9          1.0          35.1
9   2021-12-05           G  ABS123-G           1     10          5.0          35.1
10  2021-12-05           G  ABS123-G           2      1          2.0         905.4
11  2021-12-05           G  ABS123-G           2      2          4.0         905.4
12  2021-12-05           G  ABS123-G           2      3        889.0         905.4
13  2021-12-05           G  ABS123-G           2      4         12.4         905.4
14  2021-12-05           G  ABS123-G           2      5         51.4         905.4
15  2021-12-05           G  ABS123-G           2      6         12.0         905.4
16  2021-12-05           G  ABS123-G           2      7         14.0         905.4
17  2021-12-05           G  ABS123-G           2      8          2.0         905.4
18  2021-12-05           G  ABS123-G           2      9          1.0         905.4
19  2021-12-05           G  ABS123-G           2     10          0.1         905.4
20  2021-12-05           B  ABS123-B           1      1          4.0         193.5
21  2021-12-05           B  ABS123-B           1      2         58.0         193.5
22  2021-12-05           B  ABS123-B           1      3        123.0         193.5
23  2021-12-05           B  ABS123-B           1      4         12.5         193.5
24  2021-12-05           B  ABS123-B           1      5         11.0         193.5
25  2021-12-05           B  ABS123-B           1      6        135.5         193.5
26  2021-12-05           B  ABS123-B           1      7         17.0         193.5
27  2021-12-05           B  ABS123-B           1      8          1.0         193.5
28  2021-12-05           B  ABS123-B           1      9          5.0         193.5
29  2021-12-05           B  ABS123-B           1     10          0.3         193.5

示例 DataFrame 构造函数:

import pandas as pd

df = pd.DataFrame({
    'Date': pd.to_datetime(
        ['2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
         '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05']),
    'Sample Type': ['G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G',
                    'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'B', 'B', 'B', 'B',
                    'B', 'B', 'B', 'B', 'B', 'B'],
    'Lot #': ['ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
              'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
              'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
              'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
              'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B',
              'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B'],
    'Lot item #': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
                   1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    'Count': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,
              2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Measurement': [5.0, 3.0, 7.0, 25.1, 66.0, 54.0, 12.0, 0.0, 1.0, 5.0, 2.0,
                    4.0, 889.0, 12.4, 51.4, 12.0, 14.0, 2.0, 1.0, 0.1, 4.0,
                    58.0, 123.0, 12.5, 11.0, 135.5, 17.0, 1.0, 5.0, 0.3]
})