运行 在 pandas 中求和并指定行
Running sums in pandas with row specification
我有一些数据,我试图计算所有计数的总测量值以及计数 2、3 和 4 的测量值之和,对于每个批次的每个批次项目编号。理想情况下,我会在原始数据上有 2 个额外的列,其中包含总测量值和计数 2、3 和 4 的测量值——即使这些值会重复,它们也会用每条记录表示。这是数据集的示例:
Date Sample Type Lot # Lot item # Count Measurement
0 2021-12-05 G ABS123-G 1 1 5.0
1 2021-12-05 G ABS123-G 1 2 3.0
2 2021-12-05 G ABS123-G 1 3 7.0
3 2021-12-05 G ABS123-G 1 4 25.1
4 2021-12-05 G ABS123-G 1 5 66.0
5 2021-12-05 G ABS123-G 1 6 54.0
6 2021-12-05 G ABS123-G 1 7 12.0
7 2021-12-05 G ABS123-G 1 8 0.0
8 2021-12-05 G ABS123-G 1 9 1.0
9 2021-12-05 G ABS123-G 1 10 5.0
10 2021-12-05 G ABS123-G 2 1 2.0
11 2021-12-05 G ABS123-G 2 2 4.0
12 2021-12-05 G ABS123-G 2 3 889.0
13 2021-12-05 G ABS123-G 2 4 12.4
14 2021-12-05 G ABS123-G 2 5 51.4
15 2021-12-05 G ABS123-G 2 6 12.0
16 2021-12-05 G ABS123-G 2 7 14.0
17 2021-12-05 G ABS123-G 2 8 2.0
18 2021-12-05 G ABS123-G 2 9 1.0
19 2021-12-05 G ABS123-G 2 10 0.1
20 2021-12-05 B ABS123-B 1 1 4.0
21 2021-12-05 B ABS123-B 1 2 58.0
22 2021-12-05 B ABS123-B 1 3 123.0
23 2021-12-05 B ABS123-B 1 4 12.5
24 2021-12-05 B ABS123-B 1 5 11.0
25 2021-12-05 B ABS123-B 1 6 135.5
26 2021-12-05 B ABS123-B 1 7 17.0
27 2021-12-05 B ABS123-B 1 8 1.0
28 2021-12-05 B ABS123-B 1 9 5.0
29 2021-12-05 B ABS123-B 1 10 0.3
我的方法是尝试将计数过滤为 2、3、4,计算总和,然后根据批次和批次项目 # 将 df 连接到原始值,然后对总数做类似的事情。但是,当我尝试求和时 运行 出错了。
df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df2['Counts 2,3,4'] = df2.grouby(['Lot #, 'Lot item #'])['Measurement'].sum()
df2
TypeError: incompatible index of inserted column with frame index
过滤器有效,但第二部分无效。首先,我不知道是什么原因导致的错误,是否需要重新设置索引?另外,这是正确的方法吗?欢迎任何建议。
我能弄清楚是因为索引有问题。当我刚刚删除新的列名和 运行 groupby 时,它起作用了。然后我将索引重置为 groupby 并且我能够毫无问题地合并到原始 df 。与总数相同。
df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df3 = df2.groupby(['Lot #', 'Lot item #'])['Measurement'].sum()
df3 = df3.reset_index()
joined = pd.merge(df, df3, how='left', left_on=['Lot #', 'Lot item #'], right_on=['Lot #', 'Lot item #'])
我只是觉得一定有比这更优雅的解决方案?但也许不是?
我们可以使用isin
to simplify the equality checks by defining a list of integer values. We can then use join
after groupby sum and specify the columns to join on. Lastly rename
这个Series来新的列名:
# Specify columns to groupby and join back on
grp_cols = ['Lot #', 'Lot item #']
joined = df.join(
df[df['Count'].isin([2, 3, 4])] # Values to include
.groupby(grp_cols)['Measurement'].sum() # Take sum per group
.rename('Counts 2,3,4'), # Specify new column name
on=grp_cols,
)
joined
:
Date Sample Type Lot # Lot item # Count Measurement Counts 2,3,4
0 2021-12-05 G ABS123-G 1 1 5.0 35.1
1 2021-12-05 G ABS123-G 1 2 3.0 35.1
2 2021-12-05 G ABS123-G 1 3 7.0 35.1
3 2021-12-05 G ABS123-G 1 4 25.1 35.1
4 2021-12-05 G ABS123-G 1 5 66.0 35.1
5 2021-12-05 G ABS123-G 1 6 54.0 35.1
6 2021-12-05 G ABS123-G 1 7 12.0 35.1
7 2021-12-05 G ABS123-G 1 8 0.0 35.1
8 2021-12-05 G ABS123-G 1 9 1.0 35.1
9 2021-12-05 G ABS123-G 1 10 5.0 35.1
10 2021-12-05 G ABS123-G 2 1 2.0 905.4
11 2021-12-05 G ABS123-G 2 2 4.0 905.4
12 2021-12-05 G ABS123-G 2 3 889.0 905.4
13 2021-12-05 G ABS123-G 2 4 12.4 905.4
14 2021-12-05 G ABS123-G 2 5 51.4 905.4
15 2021-12-05 G ABS123-G 2 6 12.0 905.4
16 2021-12-05 G ABS123-G 2 7 14.0 905.4
17 2021-12-05 G ABS123-G 2 8 2.0 905.4
18 2021-12-05 G ABS123-G 2 9 1.0 905.4
19 2021-12-05 G ABS123-G 2 10 0.1 905.4
20 2021-12-05 B ABS123-B 1 1 4.0 193.5
21 2021-12-05 B ABS123-B 1 2 58.0 193.5
22 2021-12-05 B ABS123-B 1 3 123.0 193.5
23 2021-12-05 B ABS123-B 1 4 12.5 193.5
24 2021-12-05 B ABS123-B 1 5 11.0 193.5
25 2021-12-05 B ABS123-B 1 6 135.5 193.5
26 2021-12-05 B ABS123-B 1 7 17.0 193.5
27 2021-12-05 B ABS123-B 1 8 1.0 193.5
28 2021-12-05 B ABS123-B 1 9 5.0 193.5
29 2021-12-05 B ABS123-B 1 10 0.3 193.5
示例 DataFrame 构造函数:
import pandas as pd
df = pd.DataFrame({
'Date': pd.to_datetime(
['2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05']),
'Sample Type': ['G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G',
'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'B', 'B', 'B', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'Lot #': ['ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B',
'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B'],
'Lot item #': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Count': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10],
'Measurement': [5.0, 3.0, 7.0, 25.1, 66.0, 54.0, 12.0, 0.0, 1.0, 5.0, 2.0,
4.0, 889.0, 12.4, 51.4, 12.0, 14.0, 2.0, 1.0, 0.1, 4.0,
58.0, 123.0, 12.5, 11.0, 135.5, 17.0, 1.0, 5.0, 0.3]
})
我有一些数据,我试图计算所有计数的总测量值以及计数 2、3 和 4 的测量值之和,对于每个批次的每个批次项目编号。理想情况下,我会在原始数据上有 2 个额外的列,其中包含总测量值和计数 2、3 和 4 的测量值——即使这些值会重复,它们也会用每条记录表示。这是数据集的示例:
Date Sample Type Lot # Lot item # Count Measurement
0 2021-12-05 G ABS123-G 1 1 5.0
1 2021-12-05 G ABS123-G 1 2 3.0
2 2021-12-05 G ABS123-G 1 3 7.0
3 2021-12-05 G ABS123-G 1 4 25.1
4 2021-12-05 G ABS123-G 1 5 66.0
5 2021-12-05 G ABS123-G 1 6 54.0
6 2021-12-05 G ABS123-G 1 7 12.0
7 2021-12-05 G ABS123-G 1 8 0.0
8 2021-12-05 G ABS123-G 1 9 1.0
9 2021-12-05 G ABS123-G 1 10 5.0
10 2021-12-05 G ABS123-G 2 1 2.0
11 2021-12-05 G ABS123-G 2 2 4.0
12 2021-12-05 G ABS123-G 2 3 889.0
13 2021-12-05 G ABS123-G 2 4 12.4
14 2021-12-05 G ABS123-G 2 5 51.4
15 2021-12-05 G ABS123-G 2 6 12.0
16 2021-12-05 G ABS123-G 2 7 14.0
17 2021-12-05 G ABS123-G 2 8 2.0
18 2021-12-05 G ABS123-G 2 9 1.0
19 2021-12-05 G ABS123-G 2 10 0.1
20 2021-12-05 B ABS123-B 1 1 4.0
21 2021-12-05 B ABS123-B 1 2 58.0
22 2021-12-05 B ABS123-B 1 3 123.0
23 2021-12-05 B ABS123-B 1 4 12.5
24 2021-12-05 B ABS123-B 1 5 11.0
25 2021-12-05 B ABS123-B 1 6 135.5
26 2021-12-05 B ABS123-B 1 7 17.0
27 2021-12-05 B ABS123-B 1 8 1.0
28 2021-12-05 B ABS123-B 1 9 5.0
29 2021-12-05 B ABS123-B 1 10 0.3
我的方法是尝试将计数过滤为 2、3、4,计算总和,然后根据批次和批次项目 # 将 df 连接到原始值,然后对总数做类似的事情。但是,当我尝试求和时 运行 出错了。
df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df2['Counts 2,3,4'] = df2.grouby(['Lot #, 'Lot item #'])['Measurement'].sum()
df2
TypeError: incompatible index of inserted column with frame index
过滤器有效,但第二部分无效。首先,我不知道是什么原因导致的错误,是否需要重新设置索引?另外,这是正确的方法吗?欢迎任何建议。
我能弄清楚是因为索引有问题。当我刚刚删除新的列名和 运行 groupby 时,它起作用了。然后我将索引重置为 groupby 并且我能够毫无问题地合并到原始 df 。与总数相同。
df2 = df.loc[(df['Count'] == 2) | (df['Count'] == 3) | (df['Count'] == 4)]
df3 = df2.groupby(['Lot #', 'Lot item #'])['Measurement'].sum()
df3 = df3.reset_index()
joined = pd.merge(df, df3, how='left', left_on=['Lot #', 'Lot item #'], right_on=['Lot #', 'Lot item #'])
我只是觉得一定有比这更优雅的解决方案?但也许不是?
我们可以使用isin
to simplify the equality checks by defining a list of integer values. We can then use join
after groupby sum and specify the columns to join on. Lastly rename
这个Series来新的列名:
# Specify columns to groupby and join back on
grp_cols = ['Lot #', 'Lot item #']
joined = df.join(
df[df['Count'].isin([2, 3, 4])] # Values to include
.groupby(grp_cols)['Measurement'].sum() # Take sum per group
.rename('Counts 2,3,4'), # Specify new column name
on=grp_cols,
)
joined
:
Date Sample Type Lot # Lot item # Count Measurement Counts 2,3,4
0 2021-12-05 G ABS123-G 1 1 5.0 35.1
1 2021-12-05 G ABS123-G 1 2 3.0 35.1
2 2021-12-05 G ABS123-G 1 3 7.0 35.1
3 2021-12-05 G ABS123-G 1 4 25.1 35.1
4 2021-12-05 G ABS123-G 1 5 66.0 35.1
5 2021-12-05 G ABS123-G 1 6 54.0 35.1
6 2021-12-05 G ABS123-G 1 7 12.0 35.1
7 2021-12-05 G ABS123-G 1 8 0.0 35.1
8 2021-12-05 G ABS123-G 1 9 1.0 35.1
9 2021-12-05 G ABS123-G 1 10 5.0 35.1
10 2021-12-05 G ABS123-G 2 1 2.0 905.4
11 2021-12-05 G ABS123-G 2 2 4.0 905.4
12 2021-12-05 G ABS123-G 2 3 889.0 905.4
13 2021-12-05 G ABS123-G 2 4 12.4 905.4
14 2021-12-05 G ABS123-G 2 5 51.4 905.4
15 2021-12-05 G ABS123-G 2 6 12.0 905.4
16 2021-12-05 G ABS123-G 2 7 14.0 905.4
17 2021-12-05 G ABS123-G 2 8 2.0 905.4
18 2021-12-05 G ABS123-G 2 9 1.0 905.4
19 2021-12-05 G ABS123-G 2 10 0.1 905.4
20 2021-12-05 B ABS123-B 1 1 4.0 193.5
21 2021-12-05 B ABS123-B 1 2 58.0 193.5
22 2021-12-05 B ABS123-B 1 3 123.0 193.5
23 2021-12-05 B ABS123-B 1 4 12.5 193.5
24 2021-12-05 B ABS123-B 1 5 11.0 193.5
25 2021-12-05 B ABS123-B 1 6 135.5 193.5
26 2021-12-05 B ABS123-B 1 7 17.0 193.5
27 2021-12-05 B ABS123-B 1 8 1.0 193.5
28 2021-12-05 B ABS123-B 1 9 5.0 193.5
29 2021-12-05 B ABS123-B 1 10 0.3 193.5
示例 DataFrame 构造函数:
import pandas as pd
df = pd.DataFrame({
'Date': pd.to_datetime(
['2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05',
'2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05', '2021-12-05']),
'Sample Type': ['G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G',
'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'B', 'B', 'B', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'Lot #': ['ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G', 'ABS123-G',
'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B',
'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B', 'ABS123-B'],
'Lot item #': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Count': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10],
'Measurement': [5.0, 3.0, 7.0, 25.1, 66.0, 54.0, 12.0, 0.0, 1.0, 5.0, 2.0,
4.0, 889.0, 12.4, 51.4, 12.0, 14.0, 2.0, 1.0, 0.1, 4.0,
58.0, 123.0, 12.5, 11.0, 135.5, 17.0, 1.0, 5.0, 0.3]
})