比较两个 python pandas 数据框字符串列以识别公共字符串并将公共字符串添加到新列

Question

我有以下两个 pandas df:

df1:             df2:

item_name        item_cleaned

abc xyz          Def
xuy DEF          Ghi
s GHI lsoe       Abc
p ABc ois

我需要编写一个函数来比较 df2.item_cleaned 和 df1.item_name 以查看 df2.item_cleaned 中的字符串是否存在于 df1.item_name 中（不区分大小写）。

如果字符串存在（不区分大小写），我想创建一个新列 df1.item_final 并在该新列中为每一行输入 df2.item_cleaned 字符串值。

输出应如下所示：

df1:                                 df2:

item_name        item_final          item_cleaned

abc xyz          Abc                 Def
xuy DEF          Def                 Ghi
s GHI lsoe       Ghi                 Abc
p ABc ois        Abc

作为参考，我要清理的 df1 有 12 列和大约 400,000 行。

Answer 1

创建一个映射obj_map，键为item_cleaned的小写字母，值为item_cleaned。
使用正则表达式提取 tem_cleaned，带有标志 re.IGNORECASE
然后将提取部分降低并替换为obj_map得到item_final

import re
item_cleaned = df2['item_cleaned'].dropna().unique()
obj_map = pd.Series(dict(zip(map(str.lower, item_cleaned), item_cleaned)))

# escape the special characters
re_pat = '(%s)' % '|'.join([re.escape(i) for i in item_cleaned])

df1['item_final'] = df1['item_name'].str.extract(re_pat, flags=re.IGNORECASE)
df1['item_final'] = df1['item_final'].str.lower().map(obj_map)

obj_map

def    Def
ghi    Ghi
abc    Abc
dtype: object

df1

    item_name item_final
0     abc xyz        Abc
1     xuy DEF        Def
2  s GHI lsoe        Ghi
3   p ABc ois        Abc

比较两个 python pandas 数据框字符串列以识别公共字符串并将公共字符串添加到新列

Compare two python pandas dataframe string columns to identify common string and add the common string to new column

python

database

compare

matching

pandas