从 python 中的单词列表中查找最长的常用单词序列
Find longest sequence of common words from list of words in python
我搜索了很多解决方案,我确实找到了类似的问题。 This answer gives back the longest sequence of CHARACTERS that might NOT belong in all of the strings in the input list. 返回必须属于输入列表中所有字符串的 WORDS 的最长公共序列。
我正在寻找上述解决方案的组合。也就是说,我想要可能不会出现在所有 words/phrases 中的最长的常见单词序列输入列表的。
以下是预期的一些示例:
['exterior lighting', 'interior lighting']
--> 'lighting'
['ambient lighting', 'ambient light']
--> 'ambient'
['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light']
--> 'turn signal lamp'
['ambient lighting', 'infrared light']
--> ''
谢谢
此代码还将按照列表中最常用的词对所需列表进行排序。
它会计算列表中每个单词的数量,然后将只出现一次的单词剪掉并对其进行排序。
lst=['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light']
d = {}
d_words={}
for i in lst:
for j in i.split():
if j in d:
d[j] = d[j]+1
else:
d[j]= 1
for k,v in d.items():
if v!=1:
d_words[k] = v
sorted_words = sorted(d_words,key= d_words.get,reverse = True)
print(sorted_words)
一个相当粗糙的解决方案,但我认为它有效:
from nltk.util import everygrams
import pandas as pd
def get_word_sequence(phrases):
ngrams = []
for phrase in phrases:
phrase_split = [ token for token in phrase.split()]
ngrams.append(list(everygrams(phrase_split)))
ngrams = [i for j in ngrams for i in j] # unpack it
counts_per_ngram_series = pd.Series(ngrams).value_counts()
counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})
# discard the pandas Series
del(counts_per_ngram_series)
# filter out the ngrams that appear only once
counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]
if not counts_per_ngram_df.empty:
# populate the ngramsize column
counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()
# sort by ngramsize, ngram_char_length and then by count
counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])
# get the top ngram
top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)
return top_ngram
return ''
我搜索了很多解决方案,我确实找到了类似的问题。 This answer gives back the longest sequence of CHARACTERS that might NOT belong in all of the strings in the input list.
我正在寻找上述解决方案的组合。也就是说,我想要可能不会出现在所有 words/phrases 中的最长的常见单词序列输入列表的。
以下是预期的一些示例:
['exterior lighting', 'interior lighting']
--> 'lighting'
['ambient lighting', 'ambient light']
--> 'ambient'
['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light']
--> 'turn signal lamp'
['ambient lighting', 'infrared light']
--> ''
谢谢
此代码还将按照列表中最常用的词对所需列表进行排序。 它会计算列表中每个单词的数量,然后将只出现一次的单词剪掉并对其进行排序。
lst=['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light']
d = {}
d_words={}
for i in lst:
for j in i.split():
if j in d:
d[j] = d[j]+1
else:
d[j]= 1
for k,v in d.items():
if v!=1:
d_words[k] = v
sorted_words = sorted(d_words,key= d_words.get,reverse = True)
print(sorted_words)
一个相当粗糙的解决方案,但我认为它有效:
from nltk.util import everygrams
import pandas as pd
def get_word_sequence(phrases):
ngrams = []
for phrase in phrases:
phrase_split = [ token for token in phrase.split()]
ngrams.append(list(everygrams(phrase_split)))
ngrams = [i for j in ngrams for i in j] # unpack it
counts_per_ngram_series = pd.Series(ngrams).value_counts()
counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})
# discard the pandas Series
del(counts_per_ngram_series)
# filter out the ngrams that appear only once
counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]
if not counts_per_ngram_df.empty:
# populate the ngramsize column
counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()
# sort by ngramsize, ngram_char_length and then by count
counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])
# get the top ngram
top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)
return top_ngram
return ''