从列表中删除接近匹配项/相似短语
Remove close matches / similar phrases from list
我正在努力删除列表中的相似短语,但我遇到了一个小障碍。
我有句子和短语,短语与句子相关。一个句子的所有短语都在一个列表中。
设短语列表为:p=[['This is great','is great','place for drinks','for drinks'],['Tonight is a good','good night','is a good','for movies']]
我希望我的输出是 [['This is great','place for drinks'],['Tonight is a good','for movies']]
基本上,我想获取列表中所有最长的唯一短语。
我查看了 fuzzywuzzy 库,但找不到好的解决方案。
这是我的代码:
def remove_dup(arr, threshold=80):
ret_arr =[]
for item in arr:
if item[1]<threshold:
ret_arr.append(item[0])
return ret_arr
def find_important(sents=sents, phrase=phrase):
import os, random
from fuzzywuzzy import process, fuzz
all_processed = [] #final array to be returned
for i in range(len(sents)):
new_arr = [] #reshaped phrases for a single sentence
for item in phrase[i]:
new_arr.append(item)
new_arr.sort(reverse=True, key=lambda x : len(x)) #sort with highest length
important = [] #array to store terms
important = process.extractBests(new_arr[0], new_arr) #to get levenshtein distance matches
to_proc = remove_dup(important) #remove_dup removes all relatively matching terms.
to_proc.append(important[0][0]) #the term with highest match is obviously the important term.
all_processed.append(to_proc) #add non duplicates to all_processed[]
return all_processed
有人可以指出我遗漏了什么,或者更好的方法是什么?
提前致谢!
我会使用每个短语和所有其他短语之间的区别。
如果一个短语与所有其他短语相比至少有一个不同的词,那么它是独一无二的,应该保留。
我还使其对精确匹配和添加的空格具有鲁棒性
sentences = [['This is great','is great','place for drinks','for drinks'],
['Tonight is a good','good night','is a good','for movies'],
['Axe far his favorite brand for deodorant body spray',' Axe far his favorite brand for deodorant spray','Axe is']]
new_sentences = []
s = " "
for phrases in sentences :
new_phrases = []
phrases = [phrase.split() for phrase in phrases]
for i in range(len(phrases)) :
phrase = phrases[i]
if all([len(set(phrase).difference(phrases[j])) > 0 or i == j for j in range(len(phrases))]) :
new_phrases.append(phrase)
new_phrases = [s.join(phrase) for phrase in new_phrases]
new_sentences.append(new_phrases)
print(new_sentences)
输出:
[['This is great', 'place for drinks'],
['Tonight is a good', 'good night', 'for movies'],
['Axe far his favorite brand for deodorant body spray', 'Axe is']]
我正在努力删除列表中的相似短语,但我遇到了一个小障碍。
我有句子和短语,短语与句子相关。一个句子的所有短语都在一个列表中。
设短语列表为:p=[['This is great','is great','place for drinks','for drinks'],['Tonight is a good','good night','is a good','for movies']]
我希望我的输出是 [['This is great','place for drinks'],['Tonight is a good','for movies']]
基本上,我想获取列表中所有最长的唯一短语。
我查看了 fuzzywuzzy 库,但找不到好的解决方案。
这是我的代码:
def remove_dup(arr, threshold=80):
ret_arr =[]
for item in arr:
if item[1]<threshold:
ret_arr.append(item[0])
return ret_arr
def find_important(sents=sents, phrase=phrase):
import os, random
from fuzzywuzzy import process, fuzz
all_processed = [] #final array to be returned
for i in range(len(sents)):
new_arr = [] #reshaped phrases for a single sentence
for item in phrase[i]:
new_arr.append(item)
new_arr.sort(reverse=True, key=lambda x : len(x)) #sort with highest length
important = [] #array to store terms
important = process.extractBests(new_arr[0], new_arr) #to get levenshtein distance matches
to_proc = remove_dup(important) #remove_dup removes all relatively matching terms.
to_proc.append(important[0][0]) #the term with highest match is obviously the important term.
all_processed.append(to_proc) #add non duplicates to all_processed[]
return all_processed
有人可以指出我遗漏了什么,或者更好的方法是什么? 提前致谢!
我会使用每个短语和所有其他短语之间的区别。 如果一个短语与所有其他短语相比至少有一个不同的词,那么它是独一无二的,应该保留。
我还使其对精确匹配和添加的空格具有鲁棒性
sentences = [['This is great','is great','place for drinks','for drinks'],
['Tonight is a good','good night','is a good','for movies'],
['Axe far his favorite brand for deodorant body spray',' Axe far his favorite brand for deodorant spray','Axe is']]
new_sentences = []
s = " "
for phrases in sentences :
new_phrases = []
phrases = [phrase.split() for phrase in phrases]
for i in range(len(phrases)) :
phrase = phrases[i]
if all([len(set(phrase).difference(phrases[j])) > 0 or i == j for j in range(len(phrases))]) :
new_phrases.append(phrase)
new_phrases = [s.join(phrase) for phrase in new_phrases]
new_sentences.append(new_phrases)
print(new_sentences)
输出:
[['This is great', 'place for drinks'],
['Tonight is a good', 'good night', 'for movies'],
['Axe far his favorite brand for deodorant body spray', 'Axe is']]