如何在 Pandas 中跨不同数据帧进行关键字匹配?
How to do Keyword matching across different dataframes in Pandas?
我有 2 个数据框,我需要在其中映射关键字。
输入数据 (df1) 如下所示:
keyword subtopic
post office Brand
uspshelp uspshelp Help
package delivery Shipping
fed ex Brand
ups fedex Brand
delivery done Shipping
united states location
rt ups retweet
这是用于关键字匹配的另一个数据框 (df2):
Key Media_type cleaned_text
910040 facebook will take post office
409535 twitter need help with upshelp upshelp
218658 facebook there no section post office alabama ups fedex
218658 facebook there no section post office alabama ups fedex
518903 twitter cant wait see exactly ups fedex truck package
2423281 twitter fed ex messed seedless
763587 twitter crazy package delivery rammed car
827572 twitter formatting idead delivery done
2404106 facebook supoused mexico united states america
1077739 twitter rt ups
我想根据几个条件将 df1 中的 'keyword' 列映射到 df2 中的 'cleaned_text' 列:
- 'keyword'中的一行可以映射到'cleaned_text'中的多行(一对多关系)
- 它应该 select 整个关键字在一起,而不仅仅是单个单词。
- 如果 'keyword' 与 'cleaned_Text' 中的多行匹配,它应该在输出数据框中创建新记录 (df3)
这是输出数据帧 (df3) 的样子:
Key Media_type cleaned_text keyword subtopic
910040 facebook will take post office post office Brand
409535 twitter need help with upshelp upshelp uspshelp uspshelp Help
218658 facebook there no section post office alabama ups fedex post office Brand
218658 facebook there no section post office alabama ups fedex ups fedex Brand
518903 twitter cant wait see exactly ups fedex truck package ups fedex Brand
2423281 twitter fed ex messed seedless fed ex messed Brand
763587 twitter crazy package delivery rammed car package delivery Shipping
827572 twitter formatting idead delivery done delivery done Shipping
2404106 facebook supoused mexico united states america united states america location
1077739 twitter rt ups rt ups retweet
将你的 df1 转换成字典怎么样?然后循环遍历 df2 并搜索匹配项。这可能不是最有效的方式,但它非常可读
keyword_dict = {row.keyword: row.subtopic for row in df1.itertuples()}
df3_data = []
for row in df2.itertuples():
text = row.cleaned_text
for keyword in keyword_dict:
if keyword in text:
df3_row = [row.Key, row.Media_type, row.cleaned_text, keyword, keyword_dict[keyword]]
df3_data.append(df3_row)
df3_columns = list(df2.columns) + list(df1.columns)
df3 = pd.DataFrame(df3_data, columns=df3_columns)
我有 2 个数据框,我需要在其中映射关键字。 输入数据 (df1) 如下所示:
keyword subtopic
post office Brand
uspshelp uspshelp Help
package delivery Shipping
fed ex Brand
ups fedex Brand
delivery done Shipping
united states location
rt ups retweet
这是用于关键字匹配的另一个数据框 (df2):
Key Media_type cleaned_text
910040 facebook will take post office
409535 twitter need help with upshelp upshelp
218658 facebook there no section post office alabama ups fedex
218658 facebook there no section post office alabama ups fedex
518903 twitter cant wait see exactly ups fedex truck package
2423281 twitter fed ex messed seedless
763587 twitter crazy package delivery rammed car
827572 twitter formatting idead delivery done
2404106 facebook supoused mexico united states america
1077739 twitter rt ups
我想根据几个条件将 df1 中的 'keyword' 列映射到 df2 中的 'cleaned_text' 列:
- 'keyword'中的一行可以映射到'cleaned_text'中的多行(一对多关系)
- 它应该 select 整个关键字在一起,而不仅仅是单个单词。
- 如果 'keyword' 与 'cleaned_Text' 中的多行匹配,它应该在输出数据框中创建新记录 (df3)
这是输出数据帧 (df3) 的样子:
Key Media_type cleaned_text keyword subtopic
910040 facebook will take post office post office Brand
409535 twitter need help with upshelp upshelp uspshelp uspshelp Help
218658 facebook there no section post office alabama ups fedex post office Brand
218658 facebook there no section post office alabama ups fedex ups fedex Brand
518903 twitter cant wait see exactly ups fedex truck package ups fedex Brand
2423281 twitter fed ex messed seedless fed ex messed Brand
763587 twitter crazy package delivery rammed car package delivery Shipping
827572 twitter formatting idead delivery done delivery done Shipping
2404106 facebook supoused mexico united states america united states america location
1077739 twitter rt ups rt ups retweet
将你的 df1 转换成字典怎么样?然后循环遍历 df2 并搜索匹配项。这可能不是最有效的方式,但它非常可读
keyword_dict = {row.keyword: row.subtopic for row in df1.itertuples()}
df3_data = []
for row in df2.itertuples():
text = row.cleaned_text
for keyword in keyword_dict:
if keyword in text:
df3_row = [row.Key, row.Media_type, row.cleaned_text, keyword, keyword_dict[keyword]]
df3_data.append(df3_row)
df3_columns = list(df2.columns) + list(df1.columns)
df3 = pd.DataFrame(df3_data, columns=df3_columns)