Pandas DataFrame 与聚合合并
Pandas DataFrame merging with aggregation
让我们创建 2 个数据帧,df1 和 df2:
import pandas as pd
df1 = pd.DataFrame([["A", "A", "B", "B", "C"], ["a1", "a2", "b1", "b2", "c1"], [10, 10, 20, 20, 30], [1, 5, 6, 3, 4]]).T
df2 = pd.DataFrame([["B", "B", "C", "F"], ["b1", "b3", "c1", "f2"], [30, 30, 40, 40], [8, 3, 5, 2]]).T
df1.columns = df2.columns = ["label1", "label2", "total", "count"]
请注意,每个“label1”的“总计”必须相同
我需要按照这些规则合并这两个数据帧:
- 所有具有相同“label2”的“count”被简单地添加。例如:在 df1 中,b1=6,在 df2 中,b1=8,合并时,b1=14
- 添加具有相同“label1”的“total”(每个“label1”必须相同)。例如:在 df1 中,所有 B=20,在 df2 中,所有 B=30,合并时,所有 B=50
在 df3 中,这是我想要得到的:
我想我也许应该使用 pandas.DataFrame.merge, 1 或 2 次,但我什至不知道如何聚合数据。任何线索表示赞赏。
试试这个:
import pandas as pd
df1 = pd.DataFrame([["A", "A", "B", "B", "C"], ["a1", "a2", "b1", "b2", "c1"], [10, 10, 20, 20, 30], [1, 5, 6, 3, 4]]).T
df2 = pd.DataFrame([["B", "B", "C", "F"], ["b1", "b3", "c1", "f2"], [30, 30, 40, 40], [8, 3, 5, 2]]).T
df1.columns = df2.columns = ["label1", "label2", "total", "count"]
df_both = df1.assign(df=1).append(df2.assign(df=2))
df_lable1 = df_both.drop_duplicates(['label1', 'df']).groupby('label1')['total'].sum().reset_index()
df_lable2 = df_both.groupby('label2')['count'].sum().reset_index()
df3 = df_both[['label1', 'label2']].drop_duplicates(["label1", "label2"]).sort_values(["label1", "label2"])
df3 = df3.merge(df_lable1, on=['label1'], how='left')
df3 = df3.merge(df_lable2, on=['label2'], how='left')
# label1 label2 total count
# 0 A a1 10 1
# 1 A a2 10 5
# 2 B b1 50 14
# 3 B b2 50 3
# 4 B b3 50 3
# 5 C c1 70 9
# 6 F f2 40 2
先合并再聚合:
cols = ['label1', 'label2']
prefix = lambda x: x.split('_')[0]
out = df1.merge(df2, on=cols, how='outer').set_index(cols) \
.groupby(by=prefix, sort=False, axis=1).sum().reset_index()
print(out)
# Output
label1 label2 total count
0 A a1 10 1
1 A a2 10 5
2 B b1 50 14
3 B b2 20 3
4 C c1 70 9
5 B b3 30 3
6 F f2 40 2
试试这个:
df3 = (
pd.concat([df1, df2])
.pipe(lambda x: (x.assign(total=x['label1'].map(x.groupby('label1')['total'].unique().apply(sum))))
.groupby(['label1', 'label2'])
.agg(total=('total', 'first'), count=('count', 'sum'))
.reset_index()
)
输出:
>>> df3
label1 label2 total count
0 A a1 10 1
1 A a2 10 5
2 B b1 50 14
3 B b2 50 3
4 B b3 50 3
5 C c1 70 9
6 F f2 40 2
让我们创建 2 个数据帧,df1 和 df2:
import pandas as pd
df1 = pd.DataFrame([["A", "A", "B", "B", "C"], ["a1", "a2", "b1", "b2", "c1"], [10, 10, 20, 20, 30], [1, 5, 6, 3, 4]]).T
df2 = pd.DataFrame([["B", "B", "C", "F"], ["b1", "b3", "c1", "f2"], [30, 30, 40, 40], [8, 3, 5, 2]]).T
df1.columns = df2.columns = ["label1", "label2", "total", "count"]
请注意,每个“label1”的“总计”必须相同
我需要按照这些规则合并这两个数据帧:
- 所有具有相同“label2”的“count”被简单地添加。例如:在 df1 中,b1=6,在 df2 中,b1=8,合并时,b1=14
- 添加具有相同“label1”的“total”(每个“label1”必须相同)。例如:在 df1 中,所有 B=20,在 df2 中,所有 B=30,合并时,所有 B=50
在 df3 中,这是我想要得到的:
我想我也许应该使用 pandas.DataFrame.merge, 1 或 2 次,但我什至不知道如何聚合数据。任何线索表示赞赏。
试试这个:
import pandas as pd
df1 = pd.DataFrame([["A", "A", "B", "B", "C"], ["a1", "a2", "b1", "b2", "c1"], [10, 10, 20, 20, 30], [1, 5, 6, 3, 4]]).T
df2 = pd.DataFrame([["B", "B", "C", "F"], ["b1", "b3", "c1", "f2"], [30, 30, 40, 40], [8, 3, 5, 2]]).T
df1.columns = df2.columns = ["label1", "label2", "total", "count"]
df_both = df1.assign(df=1).append(df2.assign(df=2))
df_lable1 = df_both.drop_duplicates(['label1', 'df']).groupby('label1')['total'].sum().reset_index()
df_lable2 = df_both.groupby('label2')['count'].sum().reset_index()
df3 = df_both[['label1', 'label2']].drop_duplicates(["label1", "label2"]).sort_values(["label1", "label2"])
df3 = df3.merge(df_lable1, on=['label1'], how='left')
df3 = df3.merge(df_lable2, on=['label2'], how='left')
# label1 label2 total count
# 0 A a1 10 1
# 1 A a2 10 5
# 2 B b1 50 14
# 3 B b2 50 3
# 4 B b3 50 3
# 5 C c1 70 9
# 6 F f2 40 2
先合并再聚合:
cols = ['label1', 'label2']
prefix = lambda x: x.split('_')[0]
out = df1.merge(df2, on=cols, how='outer').set_index(cols) \
.groupby(by=prefix, sort=False, axis=1).sum().reset_index()
print(out)
# Output
label1 label2 total count
0 A a1 10 1
1 A a2 10 5
2 B b1 50 14
3 B b2 20 3
4 C c1 70 9
5 B b3 30 3
6 F f2 40 2
试试这个:
df3 = (
pd.concat([df1, df2])
.pipe(lambda x: (x.assign(total=x['label1'].map(x.groupby('label1')['total'].unique().apply(sum))))
.groupby(['label1', 'label2'])
.agg(total=('total', 'first'), count=('count', 'sum'))
.reset_index()
)
输出:
>>> df3
label1 label2 total count
0 A a1 10 1
1 A a2 10 5
2 B b1 50 14
3 B b2 50 3
4 B b3 50 3
5 C c1 70 9
6 F f2 40 2