结果合并两个具有聚合列值的数据框
Merging two data frames with aggregated columns values as result
数据帧 1
{'id': [1, 2, 3], 'dept': [101, 102, 103]}
id dept ....
1 101 ....
2 102 ....
3 103 ....
数据帧 2
{'id': [1, 1, 5], 'region1': ['CUD', 'DAS', 'ITF'], 'region2': ['IOP', 'POL', 'IJK']}
id region1 region2 ...
1 CUD IOP ...
1 DAS POL ...
5 ITF IJK ...
结果数据框应该如下
id dept concatinated
1 101 [{region1: 'CUD', region2: 'IOP'},{region1: 'DAS', region2: 'POL', ...}]
2 102 []
3 103 []
null null [{region1: 'ITF'}, {region2: 'IJK'}, ...]
注意:数据框 1 和 2 的列是动态的,期望 id(可以有 N 个列)
有什么方法可以使用 pandas 或 NumPy 来实现这个结果!!! (优化的解决方案是可观的)
我的解决方案似乎有点复杂我不确定是否有简单的方法可以做到这一点。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'id': [1, 2, 3 ,2 ,6], 'dept': [101, 102, 103 ,104,106]})
df2 = pd.DataFrame({'id': [1, 1, 5, 7], 'region1': ['CUD', 'DAS', 'ITF', "CUD"], 'region2': ['IOP', 'POL', 'IJK',"IOP"]})
df=df1.merge(df2,how="outer")
df["concatinated"] = df.apply(lambda x:{"region1":x.region1,"region2":x.region2},axis=1)
df=df.groupby(["id","dept"],dropna=False).apply(lambda x:[i for i in x.concatinated if pd.notna(i["region1"])]).reset_index()
df=df[(~df.id.duplicated()) | (df['id'].isnull())]
df.loc[~df.id.isin(df1.id),"id"] = np.nan
df=df.rename(columns={0:"concatinated"})
df
id dept concatinated
0 1.0 101.0 [{'region1': 'CUD', 'region2': 'IOP'}, {'regio...
1 2.0 102.0 []
3 3.0 103.0 []
4 NaN NaN [{'region1': 'ITF', 'region2': 'IJK'}]
5 6.0 106.0 []
6 NaN NaN [{'region1': 'CUD', 'region2': 'IOP'}]
df2['region_comb'] = df2.apply(lambda row: {col: row[col] for col in df2.columns}, axis=1, result_type='reduce')
df2 = df2.groupby('fid')['region_comb'].apply(list).reset_index(name='merged')
result_df = pd.merge(df2, df1, left_on='fid', right_on='fid', how='outer')
解决方案有效!!!
数据帧 1
{'id': [1, 2, 3], 'dept': [101, 102, 103]}
id dept ....
1 101 ....
2 102 ....
3 103 ....
数据帧 2
{'id': [1, 1, 5], 'region1': ['CUD', 'DAS', 'ITF'], 'region2': ['IOP', 'POL', 'IJK']}
id region1 region2 ...
1 CUD IOP ...
1 DAS POL ...
5 ITF IJK ...
结果数据框应该如下
id dept concatinated
1 101 [{region1: 'CUD', region2: 'IOP'},{region1: 'DAS', region2: 'POL', ...}]
2 102 []
3 103 []
null null [{region1: 'ITF'}, {region2: 'IJK'}, ...]
注意:数据框 1 和 2 的列是动态的,期望 id(可以有 N 个列) 有什么方法可以使用 pandas 或 NumPy 来实现这个结果!!! (优化的解决方案是可观的)
我的解决方案似乎有点复杂我不确定是否有简单的方法可以做到这一点。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'id': [1, 2, 3 ,2 ,6], 'dept': [101, 102, 103 ,104,106]})
df2 = pd.DataFrame({'id': [1, 1, 5, 7], 'region1': ['CUD', 'DAS', 'ITF', "CUD"], 'region2': ['IOP', 'POL', 'IJK',"IOP"]})
df=df1.merge(df2,how="outer")
df["concatinated"] = df.apply(lambda x:{"region1":x.region1,"region2":x.region2},axis=1)
df=df.groupby(["id","dept"],dropna=False).apply(lambda x:[i for i in x.concatinated if pd.notna(i["region1"])]).reset_index()
df=df[(~df.id.duplicated()) | (df['id'].isnull())]
df.loc[~df.id.isin(df1.id),"id"] = np.nan
df=df.rename(columns={0:"concatinated"})
df
id dept concatinated
0 1.0 101.0 [{'region1': 'CUD', 'region2': 'IOP'}, {'regio...
1 2.0 102.0 []
3 3.0 103.0 []
4 NaN NaN [{'region1': 'ITF', 'region2': 'IJK'}]
5 6.0 106.0 []
6 NaN NaN [{'region1': 'CUD', 'region2': 'IOP'}]
df2['region_comb'] = df2.apply(lambda row: {col: row[col] for col in df2.columns}, axis=1, result_type='reduce')
df2 = df2.groupby('fid')['region_comb'].apply(list).reset_index(name='merged')
result_df = pd.merge(df2, df1, left_on='fid', right_on='fid', how='outer')
解决方案有效!!!