检查子字符串是否在不同 DF 的字符串中，如果是则来自另一行的 return 值

Question

我想检查 DF1 中的子字符串是否在 DF2 中。如果是，我想 return 相应行的值。

DF1

Name	ID	Region
John	AAA	A
John	AAA	B
Pat	CCC	C
Sandra	CCC	D
Paul	DD	E
Sandra	R9D	F
Mia	dfg4	G
Kim	asfdh5	H
Louise	45gh	I

DF2

Name	ID	Company
John	AAAxx1	Microsoft
John	AAAxxREG1	Microsoft
Michael	BBBER4	Microsoft
Pat	CCCERG	Dell
Pat	CCCERGG	Dell
Paul	DFHDHF	Facebook

期望输出

DF1 的 ID 在 DF2 的 ID 列中我想在 DF1 中创建一个与公司匹配的新列

Name	ID	Region	Company
John	AAA	A	Microsoft
John	AAA	B	Microsoft
Pat	CCC	C	Dell
Sandra	CCC	D
Paul	DD	E
Sandra	R9D	F
Mia	dfg4	G
Kim	asfdh5	H
Louise	45gh	I

我有下面的代码来确定来自 DF1 的 ID 是否在 DF2 中，但是我不确定如何引入公司名称。

DF1['Get company'] = np.in1d(DF1['ID'], DF2['ID'])

Answer 1

尝试从 df1 中找到 ID 字符串到 df2 然后 merge 在此列：

key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = df1.merge(df2['Company'], left_on='ID', right_on=key, how='left').fillna('')
print(df1)

# Output:
    Name    ID    Company
0   John   AAA           
1  Peter   BAB  Microsoft
2   Paul  CCHF     Google
3  Rosie   R9D

详细信息：从 df1['ID'] 创建正则表达式以从 df2['ID']:

中提取部分字符串

# Regex pattern: try to extract the following pattern
>>> fr"({'|'.join(df1['ID'].values)})"
'(AAA|BAB|CCHF|R9D)'

# After extraction
>>> pd.concat([df2['ID'], key], axis=1)
        ID    ID
0    AEDSV   NaN  # Nothing was found
1   123BAB   BAB  # Found partial string BAB
2  CCHF-RB  CCHF  # Found partial string CCHF
3     YYYY   NaN  # Nothing was found

更新:

To solve this I wonder is it possible to merge based on 2 columns. e.g merge on Name and ID?

key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = pd.merge(df1, df2[['Name', 'Company']], left_on=['Name', 'ID'], 
               right_on=['Name', key], how='left').drop_duplicates().fillna('')
print(df1)

# Output:
      Name      ID Region    Company
0     John     AAA      A  Microsoft
2     John     AAA      B  Microsoft
4      Pat     CCC      C       Dell
6   Sandra     CCC      D           
7     Paul      DD      E           
8   Sandra     R9D      F           
9      Mia    dfg4      G           
10     Kim  asfdh5      H           
11  Louise    45gh      I

检查子字符串是否在不同 DF 的字符串中，如果是则来自另一行的 return 值

Check if substring is in a string in a different DF, if it is then return value from another row

match

pandas

isin