将列表中的子元素与另一个进行比较

Question

我有一个句子列表 listOfSentences，看起来像这样：

listOfSentences = ['mary had a little lamb.', 
                   'she also had a little pram.',
                   'bam bam bam she also loves ham.', 
                   'she ate the lamb.']

我还有一本 keywords 的字典，看起来像这样：

keyWords= {('bam', 3), ('lamb', 2), ('ate', 1)}

频率越高的词在 keyWords 中的键越小。

>>> print(keySentences)
>>> ['bam bam bam she also loves ham.', 'she ate the lamb.',]

我的问题是：如何比较 keyWords 中的元素与 listOfSentences 中的元素，以便输出列表 keySentences

Answer 1

这样试试：

>>> [x for x in listOfSentences for i in keyWords if x.count(i[0])==i[1]]
['bam bam bam she also loves ham.', 'she ate the lamb.']

Answer 2

keyWords如果是字典就比较有用，那么就是简单的查字典，得到每个单词的分值。可以使用 split().

提取每个单词

这里有一些代码可以做到这一点。这假设标点符号是单词的一部分（正如您的示例结果列表 keySentences 所暗示的）：

listOfSentences = ['mary had a little lamb.', 
                   'she also had a little pram.',
                   'bam bam bam she also loves ham.', 
                   'she ate the lamb.']

keyWords= [('bam', 3), ('lamb', 2), ('ate', 1)]
keyWords = dict(keyWords)

keySentences = []
for sentence in listOfSentences:
    score = sum(keyWords.get(word, 0) for word in sentence.split())
    if score > 0:
        keySentences.append((score, sentence))

keySentences = [sentence for score, sentence in sorted(keySentences, reverse=True)]
print(keySentences)

输出

['bam bam bam she also loves ham.', 'she ate the lamb.']

如果您想忽略标点符号，可以在处理之前将其从每个句子中删除：

import string

# mapping to remove punctuation with str.translate()
remove_punctuation = {ord(c): None for c in string.punctuation}

listOfSentences = ['mary had a little lamb.', 
                   'she also had a little pram.',
                   'bam bam bam she also loves ham.', 
                   'she ate the lamb.']

keyWords= [('bam', 3), ('lamb', 2), ('ate', 1)]
keyWords = dict(keyWords)

keySentences = []
for sentence in listOfSentences:
    score = sum(keyWords.get(word, 0) for word in sentence.translate(remove_punctuation).split())
    if score > 0:
        keySentences.append((score, sentence))

keySentences = [sentence for score, sentence in sorted(keySentences, reverse=True)]
print(keySentences)

输出

['bam bam bam she also loves ham.', 'she ate the lamb.', 'mary had a little lamb.']

现在生成的列表还包含 "mary had a little lamb."，因为句号尾随 "lamb" 已被 str.translate() 删除。

Answer 3

下面会根据匹配的字数给你的句子打分：

import re

keyWords = [('bam', 3), ('lamb', 2), ('ate', 1)]
keyWords = [w for w, c in keyWords]     # only need the words

listOfSentences = [
    'mary had a little lamb.', 
    'she also had a little pram.',
    'bam bam bam she also loves ham.', 
    'she ate the lamb.']    

words = [re.findall(r'(\w+)', s) for s in listOfSentences]
keySentences = []

for word_list, sentence in zip(words, listOfSentences):
    keySentences.append((len([word for word in word_list if word in keyWords]), sentence))

for count, sentence in sorted(keySentences, reverse=True):
    print '{:2}  {}'.format(count, sentence)

为您提供以下输出：

 3  bam bam bam she also loves ham.
 2  she ate the lamb.
 1  mary had a little lamb.
 0  she also had a little pram

将列表中的子元素与另一个进行比较

comparing sub elements in a list with another

python

string

comparison

nlp

list