同一列的不同聚合总和基于其他列中的分类值
different aggregated sums of the same column based on categorical values in other columns
我有一个数据框,记录了我的每个乐高积木套装盒中包含的不同乐高积木块。
对于每个套装盒,总是有许多不同的常规部件,但有时盒子中还包含一些额外的备用部件。
所以数据框有一个布尔列来区分那个条件。
现在我想总结数据集,所以我只得到每一行乐高积木集 (groupby set_id) 和一个新列,用于表示该集盒中的总块数(“数量”的总和).
我的问题是,我还需要两个额外的列,用于根据 True/False 列计算这些部件中有多少是“常规”的,有多少是“备用”的。
是否有任何方法可以通过仅创建一个额外的数据帧和一个 .agg() 调用来计算这三个总和列?
我目前的方法不是创建 3 个数据框并合并列:
import pandas as pd
import random
random.seed(1)
# creating sample data:
nrows=15
df = pd.DataFrame([], columns=["set_id","part_id","quantity","is_spare"])
df["set_id"]=["ABC"[random.randint(0,2)] for r in range(0,nrows)]
df["part_id"] = [random.randint(1000,8000) for n in range(0,nrows)]
df["quantity"] = [random.randint(1,10) for n in range(0,nrows)]
df["is_spare"]=[random.random()>0.75 for r in range(0,nrows)]
print(df)
# grouping into a new dfsummary dataframe: HOW TO DO IT IN JUST ONE STEP ?
# aggregate sum of ALL pieces:
dfsummary = df.groupby("set_id", as_index=False) \
.agg(num_pieces=("quantity","sum"))
# aggregate sum of "normal" pieces:
dfsummary2 = df.loc[df["is_spare"]==False].groupby("set_id", as_index=False) \
.agg(normal_pieces=("quantity","sum"))
# aggregate sum of "spare" pieces:
dfsummary3 = df.loc[df["is_spare"]==True].groupby("set_id", as_index=False) \
.agg(spare_pieces=("quantity","sum"))
# Putting all aggregate columns together:
dfsummary = dfsummary \
.merge(dfsummary2,on="set_id",how="left") \
.merge(dfsummary3,on="set_id",how="left")
print(dfsummary)
原始数据:
set_id part_id quantity is_spare
0 A 4545 1 False
1 C 5976 1 False
2 A 7244 9 False
3 B 7284 1 False
4 A 1017 7 False
5 B 6700 4 True
6 B 4648 7 False
7 B 3181 1 False
8 C 6910 9 False
9 B 7568 4 True
10 A 2874 8 True
11 A 5842 8 False
12 B 1837 9 False
13 A 3600 4 False
14 B 1250 6 False
汇总数据:
set_id num_pieces normal_pieces spare_pieces
0 A 37 29 8.0
1 B 32 24 8.0
2 C 10 10 NaN
我看到了这个 ,但我的情况有些不同,因为 sum() 函数只会应用于目标列的某些行,具体取决于其他列的 True/False 值。
编辑:
我用一列 (color
) 扩展了原始数据集,以检查@sammywemmy 的答案是否可以缩放以一次对多列进行分组和拆分:
df["color"]=[["black","grey","white","red"][random.randint(0,3)] \
for r in range(0,nrows)]
set_id part_id quantity is_spare color
0 A 4545 1 False red
1 C 5976 1 False grey
2 A 7244 9 False white
3 B 7284 1 False white
4 A 1017 7 False red
5 B 6700 4 True red
6 B 4648 7 False black
7 B 3181 1 False red
8 C 6910 9 False grey
9 B 7568 4 True red
10 A 2874 8 True red
11 A 5842 8 False grey
12 B 1837 9 False white
13 A 3600 4 False white
14 B 1250 6 False black
汇总数据:
set_id num_pieces normal_pieces spare_pieces black grey red white
0 A 37 29 8.0 NaN 8.0 16.0 13.0
1 B 32 24 8.0 13.0 NaN 9.0 10.0
2 C 10 10 NaN NaN 10.0 NaN NaN
一行即可完成。诀窍是创建一个临时列,其中 spare_pieces
的数量为负,normal_pieces
的数量为正:
out = df.assign(qty=df['is_spare'].replace({True: -1, False: 1}) * df['quantity']) \
.groupby('set_id')['qty'] \
.agg(num_pieces=lambda x: sum(abs(x)),
normal_pieces=lambda x: sum(x[x > 0]),
sparse_pieces=lambda x: abs(sum(x[x < 0]))) \
.reset_index()
输出:
>>> out
set_id num_pieces normal_pieces sparse_pieces
0 A 37 29 8
1 B 32 24 8
2 C 10 10 0
>>> df['is_spare'].replace({True: -1, False: 1}) * df['quantity'])
0 1 # normal_pieces
1 1
2 9
3 1
4 7
5 -4 # spare_pieces
6 7
7 1
8 9
9 -4
10 -8
11 8
12 9
13 4
14 6
dtype: int64
一种选择是进行分组并取消堆叠:
(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)
set_id normal_pieces spare_pieces num_pieces
0 A 29.0 8.0 37.0
1 B 24.0 8.0 32.0
2 C 10.0 NaN 10.0
对于更新的解决方案,您可以使用 groupby 和 unstack - 我将直接跳到 pivot_table
,它是 groupby 和 pivot 的包装器:
temp = df.pivot_table(index='set_id',
columns=['is_spare', 'color'],
values='quantity',
aggfunc='sum')
# get the sum of `red`, `blue`, ...
colors = temp.groupby(level='color', axis=1).sum(1)
#pandas MultiIndex works nicely here
# where we can select the top columns and sum
# in this case, `False`, and `True`
(temp.assign(num_pieces = temp.sum(1),
normal_pieces = temp[False].sum(1),
spare_pieces = temp[True].sum(1),
# assign is basically an expansion of a dictionary
# and here we take advantage of that
**colors)
.drop(columns=[False, True])
.reset_index()
.rename_axis(columns=[None, None], index=None)
)
set_id num_pieces normal_pieces spare_pieces black grey red white
0 A 37.0 29.0 8.0 0.0 8.0 16.0 13.0
1 B 32.0 24.0 8.0 13.0 0.0 9.0 10.0
2 C 10.0 10.0 0.0 0.0 10.0 0.0 0.0
另一个可能更快的选项(只调用一次 groupby)是在分组之前使用 get_dummies:
temp = df.set_index('set_id').loc[:, ['is_spare', 'color', 'quantity']]
# get_dummies returns 0 and 1, depending on if the value exists
# so if `blue` exists for a row, 1 is assigned, else 0
(pd.get_dummies(temp.drop(columns='quantity'),
columns = ['is_spare', 'color'],
prefix='',
prefix_sep='')
# here we do a conditional replacement
# similar to python's if-else statement
# replacing the 1s with quantity
.where(lambda df: df == 0, temp.quantity, axis = 0)
# from here on it is grouping
# with some renaming
.groupby('set_id')
.sum()
.assign(num_pieces = lambda df: df[['False', 'True']].sum(1))
.rename(columns={'False':'normal_pieces', 'True':'spare_pieces'})
)
normal_pieces spare_pieces black grey red white num_pieces
set_id
A 29 8 0 8 16 13 37
B 24 8 13 0 9 10 32
C 10 0 0 10 0 0 10
我有一个数据框,记录了我的每个乐高积木套装盒中包含的不同乐高积木块。 对于每个套装盒,总是有许多不同的常规部件,但有时盒子中还包含一些额外的备用部件。 所以数据框有一个布尔列来区分那个条件。
现在我想总结数据集,所以我只得到每一行乐高积木集 (groupby set_id) 和一个新列,用于表示该集盒中的总块数(“数量”的总和).
我的问题是,我还需要两个额外的列,用于根据 True/False 列计算这些部件中有多少是“常规”的,有多少是“备用”的。
是否有任何方法可以通过仅创建一个额外的数据帧和一个 .agg() 调用来计算这三个总和列?
我目前的方法不是创建 3 个数据框并合并列:
import pandas as pd
import random
random.seed(1)
# creating sample data:
nrows=15
df = pd.DataFrame([], columns=["set_id","part_id","quantity","is_spare"])
df["set_id"]=["ABC"[random.randint(0,2)] for r in range(0,nrows)]
df["part_id"] = [random.randint(1000,8000) for n in range(0,nrows)]
df["quantity"] = [random.randint(1,10) for n in range(0,nrows)]
df["is_spare"]=[random.random()>0.75 for r in range(0,nrows)]
print(df)
# grouping into a new dfsummary dataframe: HOW TO DO IT IN JUST ONE STEP ?
# aggregate sum of ALL pieces:
dfsummary = df.groupby("set_id", as_index=False) \
.agg(num_pieces=("quantity","sum"))
# aggregate sum of "normal" pieces:
dfsummary2 = df.loc[df["is_spare"]==False].groupby("set_id", as_index=False) \
.agg(normal_pieces=("quantity","sum"))
# aggregate sum of "spare" pieces:
dfsummary3 = df.loc[df["is_spare"]==True].groupby("set_id", as_index=False) \
.agg(spare_pieces=("quantity","sum"))
# Putting all aggregate columns together:
dfsummary = dfsummary \
.merge(dfsummary2,on="set_id",how="left") \
.merge(dfsummary3,on="set_id",how="left")
print(dfsummary)
原始数据:
set_id part_id quantity is_spare
0 A 4545 1 False
1 C 5976 1 False
2 A 7244 9 False
3 B 7284 1 False
4 A 1017 7 False
5 B 6700 4 True
6 B 4648 7 False
7 B 3181 1 False
8 C 6910 9 False
9 B 7568 4 True
10 A 2874 8 True
11 A 5842 8 False
12 B 1837 9 False
13 A 3600 4 False
14 B 1250 6 False
汇总数据:
set_id num_pieces normal_pieces spare_pieces
0 A 37 29 8.0
1 B 32 24 8.0
2 C 10 10 NaN
我看到了这个
编辑:
我用一列 (color
) 扩展了原始数据集,以检查@sammywemmy 的答案是否可以缩放以一次对多列进行分组和拆分:
df["color"]=[["black","grey","white","red"][random.randint(0,3)] \
for r in range(0,nrows)]
set_id part_id quantity is_spare color
0 A 4545 1 False red
1 C 5976 1 False grey
2 A 7244 9 False white
3 B 7284 1 False white
4 A 1017 7 False red
5 B 6700 4 True red
6 B 4648 7 False black
7 B 3181 1 False red
8 C 6910 9 False grey
9 B 7568 4 True red
10 A 2874 8 True red
11 A 5842 8 False grey
12 B 1837 9 False white
13 A 3600 4 False white
14 B 1250 6 False black
汇总数据:
set_id num_pieces normal_pieces spare_pieces black grey red white
0 A 37 29 8.0 NaN 8.0 16.0 13.0
1 B 32 24 8.0 13.0 NaN 9.0 10.0
2 C 10 10 NaN NaN 10.0 NaN NaN
一行即可完成。诀窍是创建一个临时列,其中 spare_pieces
的数量为负,normal_pieces
的数量为正:
out = df.assign(qty=df['is_spare'].replace({True: -1, False: 1}) * df['quantity']) \
.groupby('set_id')['qty'] \
.agg(num_pieces=lambda x: sum(abs(x)),
normal_pieces=lambda x: sum(x[x > 0]),
sparse_pieces=lambda x: abs(sum(x[x < 0]))) \
.reset_index()
输出:
>>> out
set_id num_pieces normal_pieces sparse_pieces
0 A 37 29 8
1 B 32 24 8
2 C 10 10 0
>>> df['is_spare'].replace({True: -1, False: 1}) * df['quantity'])
0 1 # normal_pieces
1 1
2 9
3 1
4 7
5 -4 # spare_pieces
6 7
7 1
8 9
9 -4
10 -8
11 8
12 9
13 4
14 6
dtype: int64
一种选择是进行分组并取消堆叠:
(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)
set_id normal_pieces spare_pieces num_pieces
0 A 29.0 8.0 37.0
1 B 24.0 8.0 32.0
2 C 10.0 NaN 10.0
对于更新的解决方案,您可以使用 groupby 和 unstack - 我将直接跳到 pivot_table
,它是 groupby 和 pivot 的包装器:
temp = df.pivot_table(index='set_id',
columns=['is_spare', 'color'],
values='quantity',
aggfunc='sum')
# get the sum of `red`, `blue`, ...
colors = temp.groupby(level='color', axis=1).sum(1)
#pandas MultiIndex works nicely here
# where we can select the top columns and sum
# in this case, `False`, and `True`
(temp.assign(num_pieces = temp.sum(1),
normal_pieces = temp[False].sum(1),
spare_pieces = temp[True].sum(1),
# assign is basically an expansion of a dictionary
# and here we take advantage of that
**colors)
.drop(columns=[False, True])
.reset_index()
.rename_axis(columns=[None, None], index=None)
)
set_id num_pieces normal_pieces spare_pieces black grey red white
0 A 37.0 29.0 8.0 0.0 8.0 16.0 13.0
1 B 32.0 24.0 8.0 13.0 0.0 9.0 10.0
2 C 10.0 10.0 0.0 0.0 10.0 0.0 0.0
另一个可能更快的选项(只调用一次 groupby)是在分组之前使用 get_dummies:
temp = df.set_index('set_id').loc[:, ['is_spare', 'color', 'quantity']]
# get_dummies returns 0 and 1, depending on if the value exists
# so if `blue` exists for a row, 1 is assigned, else 0
(pd.get_dummies(temp.drop(columns='quantity'),
columns = ['is_spare', 'color'],
prefix='',
prefix_sep='')
# here we do a conditional replacement
# similar to python's if-else statement
# replacing the 1s with quantity
.where(lambda df: df == 0, temp.quantity, axis = 0)
# from here on it is grouping
# with some renaming
.groupby('set_id')
.sum()
.assign(num_pieces = lambda df: df[['False', 'True']].sum(1))
.rename(columns={'False':'normal_pieces', 'True':'spare_pieces'})
)
normal_pieces spare_pieces black grey red white num_pieces
set_id
A 29 8 0 8 16 13 37
B 24 8 13 0 9 10 32
C 10 0 0 10 0 0 10