与数据框中的文本列匹配的单词列表
List of words matched with text column in dataframe
我有 2 个数据框,第一个是文本数据列(超过 10k 行),第二个是关键字(将近 100 个列表)
数据帧 1:
Text
a white house cat plays in garden
cat is a domestic species of small carnivorous mammal
cat is walking in garden behind white house
yellow banana is healthy
数据帧 2:
ID Keywords
1 ['cat','white']
2 ['garden','white','cat']
3 ['domestic','mammal']
我想在数据帧 1 中添加 ID 列,其中最大字数与数据帧 2 匹配。此外,如果超过 1 个或 2 个 ID 之间存在联系,则将两个 ID 连接在一起。因此,在某些情况下 None 个单词匹配,在这种情况下添加 'No Match'。
输出:
Text ID
a white house cat plays in garden 2
cat is a domestic species of small carnivorous mammal 3
cat is walking in behind white house 1,2
yellow banana is healthy 'No Match'
这会奏效。它为每个关键字列表创建一个匹配数列表,然后查找该列表中最大值的 ID
。
import pandas as pd
import ast
df1 = pd.DataFrame(['a white house cat plays in garden', 'cat is a domestic species of small carnivorous mammal', 'cat is walking in behind white house', 'yellow banana is healthy'], columns=['Text'])
df2 = pd.DataFrame([ { "ID": 1, "Keywords": "['cat','white']" }, { "ID": 2, "Keywords": "['garden','white','cat']" }, { "ID": 3, "Keywords": "['domestic','mammal']" } ])
df2['Keywords'] = df2['Keywords'].apply(ast.literal_eval)
def get_ids(text):
matches = [len(set(text.split(" ")) & set(i)) for i in df2['Keywords']]
matches_ids = [df2['ID'][index] for index, val in enumerate(matches) if val == max(matches) if max(matches)>0 ]
return ", ".join(str(x) for x in matches_ids) if matches_ids else "No Match"
df1['ID'] = df1['Text'].apply(get_ids)
结果:
Text
ID
0
a white house cat plays in garden
2
1
cat is a domestic species of small carnivorous mammal
3
2
cat is walking in behind white house
1, 2
3
yellow banana is healthy
No Match
我有 2 个数据框,第一个是文本数据列(超过 10k 行),第二个是关键字(将近 100 个列表)
数据帧 1:
Text
a white house cat plays in garden
cat is a domestic species of small carnivorous mammal
cat is walking in garden behind white house
yellow banana is healthy
数据帧 2:
ID Keywords
1 ['cat','white']
2 ['garden','white','cat']
3 ['domestic','mammal']
我想在数据帧 1 中添加 ID 列,其中最大字数与数据帧 2 匹配。此外,如果超过 1 个或 2 个 ID 之间存在联系,则将两个 ID 连接在一起。因此,在某些情况下 None 个单词匹配,在这种情况下添加 'No Match'。
输出:
Text ID
a white house cat plays in garden 2
cat is a domestic species of small carnivorous mammal 3
cat is walking in behind white house 1,2
yellow banana is healthy 'No Match'
这会奏效。它为每个关键字列表创建一个匹配数列表,然后查找该列表中最大值的 ID
。
import pandas as pd
import ast
df1 = pd.DataFrame(['a white house cat plays in garden', 'cat is a domestic species of small carnivorous mammal', 'cat is walking in behind white house', 'yellow banana is healthy'], columns=['Text'])
df2 = pd.DataFrame([ { "ID": 1, "Keywords": "['cat','white']" }, { "ID": 2, "Keywords": "['garden','white','cat']" }, { "ID": 3, "Keywords": "['domestic','mammal']" } ])
df2['Keywords'] = df2['Keywords'].apply(ast.literal_eval)
def get_ids(text):
matches = [len(set(text.split(" ")) & set(i)) for i in df2['Keywords']]
matches_ids = [df2['ID'][index] for index, val in enumerate(matches) if val == max(matches) if max(matches)>0 ]
return ", ".join(str(x) for x in matches_ids) if matches_ids else "No Match"
df1['ID'] = df1['Text'].apply(get_ids)
结果:
Text | ID | |
---|---|---|
0 | a white house cat plays in garden | 2 |
1 | cat is a domestic species of small carnivorous mammal | 3 |
2 | cat is walking in behind white house | 1, 2 |
3 | yellow banana is healthy | No Match |