检查子字符串是否在不同 DF 的字符串中,如果是则来自另一行的 return 值
Check if substring is in a string in a different DF, if it is then return value from another row
我想检查 DF1 中的子字符串是否在 DF2 中。如果是,我想 return 相应行的值。
DF1
Name
ID
Region
John
AAA
A
John
AAA
B
Pat
CCC
C
Sandra
CCC
D
Paul
DD
E
Sandra
R9D
F
Mia
dfg4
G
Kim
asfdh5
H
Louise
45gh
I
DF2
Name
ID
Company
John
AAAxx1
Microsoft
John
AAAxxREG1
Microsoft
Michael
BBBER4
Microsoft
Pat
CCCERG
Dell
Pat
CCCERGG
Dell
Paul
DFHDHF
Facebook
期望输出
DF1 的 ID 在 DF2 的 ID 列中我想在 DF1 中创建一个与公司匹配的新列
Name
ID
Region
Company
John
AAA
A
Microsoft
John
AAA
B
Microsoft
Pat
CCC
C
Dell
Sandra
CCC
D
Paul
DD
E
Sandra
R9D
F
Mia
dfg4
G
Kim
asfdh5
H
Louise
45gh
I
我有下面的代码来确定来自 DF1 的 ID 是否在 DF2 中,但是我不确定如何引入公司名称。
DF1['Get company'] = np.in1d(DF1['ID'], DF2['ID'])
尝试从 df1
中找到 ID
字符串到 df2
然后 merge
在此列:
key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = df1.merge(df2['Company'], left_on='ID', right_on=key, how='left').fillna('')
print(df1)
# Output:
Name ID Company
0 John AAA
1 Peter BAB Microsoft
2 Paul CCHF Google
3 Rosie R9D
详细信息:从 df1['ID']
创建正则表达式以从 df2['ID']
:
中提取部分字符串
# Regex pattern: try to extract the following pattern
>>> fr"({'|'.join(df1['ID'].values)})"
'(AAA|BAB|CCHF|R9D)'
# After extraction
>>> pd.concat([df2['ID'], key], axis=1)
ID ID
0 AEDSV NaN # Nothing was found
1 123BAB BAB # Found partial string BAB
2 CCHF-RB CCHF # Found partial string CCHF
3 YYYY NaN # Nothing was found
更新:
To solve this I wonder is it possible to merge based on 2 columns. e.g merge on Name and ID?
key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = pd.merge(df1, df2[['Name', 'Company']], left_on=['Name', 'ID'],
right_on=['Name', key], how='left').drop_duplicates().fillna('')
print(df1)
# Output:
Name ID Region Company
0 John AAA A Microsoft
2 John AAA B Microsoft
4 Pat CCC C Dell
6 Sandra CCC D
7 Paul DD E
8 Sandra R9D F
9 Mia dfg4 G
10 Kim asfdh5 H
11 Louise 45gh I
我想检查 DF1 中的子字符串是否在 DF2 中。如果是,我想 return 相应行的值。
DF1
Name | ID | Region |
---|---|---|
John | AAA | A |
John | AAA | B |
Pat | CCC | C |
Sandra | CCC | D |
Paul | DD | E |
Sandra | R9D | F |
Mia | dfg4 | G |
Kim | asfdh5 | H |
Louise | 45gh | I |
DF2
Name | ID | Company |
---|---|---|
John | AAAxx1 | Microsoft |
John | AAAxxREG1 | Microsoft |
Michael | BBBER4 | Microsoft |
Pat | CCCERG | Dell |
Pat | CCCERGG | Dell |
Paul | DFHDHF |
期望输出
DF1 的 ID 在 DF2 的 ID 列中我想在 DF1 中创建一个与公司匹配的新列
Name | ID | Region | Company |
---|---|---|---|
John | AAA | A | Microsoft |
John | AAA | B | Microsoft |
Pat | CCC | C | Dell |
Sandra | CCC | D | |
Paul | DD | E | |
Sandra | R9D | F | |
Mia | dfg4 | G | |
Kim | asfdh5 | H | |
Louise | 45gh | I |
我有下面的代码来确定来自 DF1 的 ID 是否在 DF2 中,但是我不确定如何引入公司名称。
DF1['Get company'] = np.in1d(DF1['ID'], DF2['ID'])
尝试从 df1
中找到 ID
字符串到 df2
然后 merge
在此列:
key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = df1.merge(df2['Company'], left_on='ID', right_on=key, how='left').fillna('')
print(df1)
# Output:
Name ID Company
0 John AAA
1 Peter BAB Microsoft
2 Paul CCHF Google
3 Rosie R9D
详细信息:从 df1['ID']
创建正则表达式以从 df2['ID']
:
# Regex pattern: try to extract the following pattern
>>> fr"({'|'.join(df1['ID'].values)})"
'(AAA|BAB|CCHF|R9D)'
# After extraction
>>> pd.concat([df2['ID'], key], axis=1)
ID ID
0 AEDSV NaN # Nothing was found
1 123BAB BAB # Found partial string BAB
2 CCHF-RB CCHF # Found partial string CCHF
3 YYYY NaN # Nothing was found
更新:
To solve this I wonder is it possible to merge based on 2 columns. e.g merge on Name and ID?
key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = pd.merge(df1, df2[['Name', 'Company']], left_on=['Name', 'ID'],
right_on=['Name', key], how='left').drop_duplicates().fillna('')
print(df1)
# Output:
Name ID Region Company
0 John AAA A Microsoft
2 John AAA B Microsoft
4 Pat CCC C Dell
6 Sandra CCC D
7 Paul DD E
8 Sandra R9D F
9 Mia dfg4 G
10 Kim asfdh5 H
11 Louise 45gh I