使用数据框查找另一个数据框
Use a dataframe as lookup for another dataframe
我有两个数据帧 df_1
和 df_2
df_1
是我的主数据框,df_2
是查找数据框。
我想测试 df_1[‘col_c1’]
中的值是否包含 df_2[‘col_a2’]
中的任何值。
如果为真(可以多次匹配!);
- 将
df_2[‘col_b2’]
的值添加到 df_1[‘col_d1’]
- 将
df_2[‘col_c2’]
中的值添加到 df_1[‘col_e1’]
我怎样才能做到这一点?
我真的不知道,因此我不能为此分享任何代码。
样本df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
----------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | |
1_002 | zzzzz | ggggjjjjjkkkkk | |
1_003 | pppp | qqqqffffgggg | |
1_004 | sss | wwwcccyyy | |
1_005 | eeeeee | eecccffffll | |
1_006 | tttt | hhggeeuuuuu | |
样本df_2
col_a2 | col_b2 | col_c2
------------------------------
ccc | 2_001 | some_data_c
jjj | 2_002 | some_data_j
fff | 2_003 | some_data_f
期望的输出df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
------------------------------------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | 2_001 | some_data_c
1_002 | zzzzz | ggggjjjjjkkkkk | 2_002 | some_data_j
1_003 | pppp | qqqqffffgggg | 2_003 | some_data_f
1_004 | sss | wwwcccyyy | 2_001 | some_data_c
1_005 | eeeeee | eecccffffll | 2_001;2_003 | some_data_c; some_data_f
1_006 | tttt | hhggeeuuuuu | |
df_1 有大约 45.000 行和 df_2 大约。 16.000 行。 (还添加了一个不匹配的行)
我已经为此苦苦挣扎了几个小时,但我真的不知道。
我不认为合并是一种选择,因为没有完全匹配。
非常感谢您的帮助。
这将解决它
df['col_d1'] = df.apply(lambda x: ','.join([df2['col_b2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
df['col_e1'] = df.apply(lambda x: ','.join([df2['col_c2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
输出
col_a1 col_b1 col_c1 col_d1 \
0 1_001 aaaaaa bbbbccccdddd 2_001
1 1_002 zzzzz ggggjjjjjkkkkk 2_002
2 1_003 pppp qqqqffffgggg 2_003
3 1_004 sss wwwcccyyy 2_001
4 1_005 eeeeee eecccffffll 2_001 , 2_003
col_e1
0 some_data_c
1 some_data_j
2 some_data_f
3 some_data_c
4 some_data_c; some_data_f
使用:
#exctract values by df_2["col_a2"] to new column
s = (df_1['col_c1'].str.extractall(f'({"|".join(df_2["col_a2"])})')[0].rename('new')
.reset_index(level=1, drop=True))
#repeat rows with duplicated match
df_1 = df_1.join(s)
#add new columns by map
df_1['col_d1'] = df_1['new'].map(df_2.set_index('col_a2')['col_b2'])
df_1['col_e1'] = df_1['new'].map(df_2.set_index('col_a2')['col_c2'])
#aggregate join
cols = df_1.columns.difference(['new','col_d1','col_e1']).tolist()
df = df_1.drop('new', axis=1).groupby(cols).agg(','.join).reset_index()
print (df)
col_a1 col_b1 col_c1 col_d1 col_e1
0 1_001 aaaaaa bbbbccccdddd 2_001 some_data_c
1 1_002 zzzzz ggggjjjjjkkkkk 2_002 some_data_j
2 1_003 pppp qqqqffffgggg 2_003 some_data_f
3 1_004 sss wwwcccyyy 2_001 some_data_c
4 1_005 eeeeee eecccffffll 2_001,2_003 some_data_c,some_data_f
我有两个数据帧 df_1
和 df_2
df_1
是我的主数据框,df_2
是查找数据框。
我想测试 df_1[‘col_c1’]
中的值是否包含 df_2[‘col_a2’]
中的任何值。
如果为真(可以多次匹配!);
- 将
df_2[‘col_b2’]
的值添加到df_1[‘col_d1’]
- 将
df_2[‘col_c2’]
中的值添加到df_1[‘col_e1’]
我怎样才能做到这一点?
我真的不知道,因此我不能为此分享任何代码。
样本df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
----------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | |
1_002 | zzzzz | ggggjjjjjkkkkk | |
1_003 | pppp | qqqqffffgggg | |
1_004 | sss | wwwcccyyy | |
1_005 | eeeeee | eecccffffll | |
1_006 | tttt | hhggeeuuuuu | |
样本df_2
col_a2 | col_b2 | col_c2
------------------------------
ccc | 2_001 | some_data_c
jjj | 2_002 | some_data_j
fff | 2_003 | some_data_f
期望的输出df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
------------------------------------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | 2_001 | some_data_c
1_002 | zzzzz | ggggjjjjjkkkkk | 2_002 | some_data_j
1_003 | pppp | qqqqffffgggg | 2_003 | some_data_f
1_004 | sss | wwwcccyyy | 2_001 | some_data_c
1_005 | eeeeee | eecccffffll | 2_001;2_003 | some_data_c; some_data_f
1_006 | tttt | hhggeeuuuuu | |
df_1 有大约 45.000 行和 df_2 大约。 16.000 行。 (还添加了一个不匹配的行)
我已经为此苦苦挣扎了几个小时,但我真的不知道。
我不认为合并是一种选择,因为没有完全匹配。
非常感谢您的帮助。
这将解决它
df['col_d1'] = df.apply(lambda x: ','.join([df2['col_b2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
df['col_e1'] = df.apply(lambda x: ','.join([df2['col_c2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
输出
col_a1 col_b1 col_c1 col_d1 \
0 1_001 aaaaaa bbbbccccdddd 2_001
1 1_002 zzzzz ggggjjjjjkkkkk 2_002
2 1_003 pppp qqqqffffgggg 2_003
3 1_004 sss wwwcccyyy 2_001
4 1_005 eeeeee eecccffffll 2_001 , 2_003
col_e1
0 some_data_c
1 some_data_j
2 some_data_f
3 some_data_c
4 some_data_c; some_data_f
使用:
#exctract values by df_2["col_a2"] to new column
s = (df_1['col_c1'].str.extractall(f'({"|".join(df_2["col_a2"])})')[0].rename('new')
.reset_index(level=1, drop=True))
#repeat rows with duplicated match
df_1 = df_1.join(s)
#add new columns by map
df_1['col_d1'] = df_1['new'].map(df_2.set_index('col_a2')['col_b2'])
df_1['col_e1'] = df_1['new'].map(df_2.set_index('col_a2')['col_c2'])
#aggregate join
cols = df_1.columns.difference(['new','col_d1','col_e1']).tolist()
df = df_1.drop('new', axis=1).groupby(cols).agg(','.join).reset_index()
print (df)
col_a1 col_b1 col_c1 col_d1 col_e1
0 1_001 aaaaaa bbbbccccdddd 2_001 some_data_c
1 1_002 zzzzz ggggjjjjjkkkkk 2_002 some_data_j
2 1_003 pppp qqqqffffgggg 2_003 some_data_f
3 1_004 sss wwwcccyyy 2_001 some_data_c
4 1_005 eeeeee eecccffffll 2_001,2_003 some_data_c,some_data_f