按 python 中的子串匹配两个数据帧
Match two data frames by substring in python
我有两个大数据框(1000行),我需要通过子字符串匹配它们,例如:
df1:
Id Title
1 The house of pump
2 Where is Andijan
3 The Joker
4 Good bars in Andijan
5 What a beautiful house
df2:
Keyword
house
andijan
joker
预期的输出是:
Id Title Keyword
1 The house of pump house
2 Where is Andijan andijan
3 The Joker joker
4 Good bars in Andijan andijan
5 What a beautiful house house
现在,我已经编写了一种非常低效的方法来匹配它,但是对于数据帧的实际大小,它 运行 持续了很长时间:
for keyword in df2.to_dict(orient='records'):
df1['keyword'] = np.where(creative_df['title'].str.contains(keyword['keyword']), keyword['keyword'], df1['keyword'])
现在,我确定有一种更 pandas 更友好、更有效的方法来做同样的事情,而且 运行 在合理的时间内完成。
让我们试试findall
import re
df1['new'] = df1.Title.str.findall('|'.join(df2.Keyword.tolist()),flags= re.IGNORECASE).str[0]
df1
Id Title new
0 1 The house of pump house
1 2 Where is Andijan Andijan
2 3 The Joker Joker
3 4 Good bars in Andijan Andijan
4 5 What a beautiful house house
进一步开发@BENY 的解决方案,以便能够为每个标题获取多个关键字:
regex = '|'.join(keywords['Keyword'])
keywords = df['Title'].str.findall(regex, flags=re.IGNORECASE)
keywords_exploded = pd.DataFrame(keywords.explode().dropna())
df.merge(keywords_exploded, left_index=True, right_index=True)
我有两个大数据框(1000行),我需要通过子字符串匹配它们,例如:
df1:
Id Title
1 The house of pump
2 Where is Andijan
3 The Joker
4 Good bars in Andijan
5 What a beautiful house
df2:
Keyword
house
andijan
joker
预期的输出是:
Id Title Keyword
1 The house of pump house
2 Where is Andijan andijan
3 The Joker joker
4 Good bars in Andijan andijan
5 What a beautiful house house
现在,我已经编写了一种非常低效的方法来匹配它,但是对于数据帧的实际大小,它 运行 持续了很长时间:
for keyword in df2.to_dict(orient='records'):
df1['keyword'] = np.where(creative_df['title'].str.contains(keyword['keyword']), keyword['keyword'], df1['keyword'])
现在,我确定有一种更 pandas 更友好、更有效的方法来做同样的事情,而且 运行 在合理的时间内完成。
让我们试试findall
import re
df1['new'] = df1.Title.str.findall('|'.join(df2.Keyword.tolist()),flags= re.IGNORECASE).str[0]
df1
Id Title new
0 1 The house of pump house
1 2 Where is Andijan Andijan
2 3 The Joker Joker
3 4 Good bars in Andijan Andijan
4 5 What a beautiful house house
进一步开发@BENY 的解决方案,以便能够为每个标题获取多个关键字:
regex = '|'.join(keywords['Keyword'])
keywords = df['Title'].str.findall(regex, flags=re.IGNORECASE)
keywords_exploded = pd.DataFrame(keywords.explode().dropna())
df.merge(keywords_exploded, left_index=True, right_index=True)