Python：给定一个特殊标记列表，显示所述标记在句子子列表中的分布

Question

我的问题有两部分，这里是第一部分：

我有一个这样的列表列表：

big_list = [['i have an apple','i have a pear','i am a monkey', 'tell me about apples'],
['tell me about cars','tell me about trucks','tell me about planes']]

我也有这样的单词列表：

words = ['i','have','monkey','tell','me','about']

我想遍历 big_list，检查每个子列表是否包含多个来自 words[=52 的元素=] 顺序 。例如，big_list[0] 将包含来自第一个和第二个元素的 'i have' 和 'tell me about' 在最后一个元素中。

目前我正在子列表级别尝试此操作，我首先标记子列表中的所有字符串，以便我可以遍历它们的元素以查看来自 words 的元素出现在哪里:

import nltk example = big_list[0] example_sentences_tokens = [] for sentence in example: example_sentences_tokens.append([token.lower() for token in nltk.tokenize.word_tokenize(sentence)])

在访问原始字符串和标记化字符串后，我检查了 words 中的元素出现的位置：

tuples = [] for sentence, tokenized_sentence in zip(example,example_sentences_tokens): tuples.append(tuple((sentence,[token for token in example_sentences_tokens if token in words])))

现在，tuples 是一个元组列表，每个元组包含来自 big_list[0] 的每个句子和所有元素该句存在个字.

但是，我只想包含存在于 words 中的标记，前提是它们按顺序出现，而不是单独出现。我该怎么做？

第二部分问题： 最后，一旦我确定了 words 中的一系列元素一起出现在 big_list 中的所有实例，id 想显示这些元素序列在所有子列表中的频率。所以 告诉我 出现在 100% 的 big_list[1] 和 33% 的 big_list[0]。有没有一种简单的方法来显示这种分布？

Answer 1

第一个问题

首先，在测试您的代码时，我不得不更改您的 tuples 内容以实际收集 words 和 tokenized_sentence 之间的公共元素（我得到的只是元组，例如（句子, []) 否则):

tuples.append(((sentence,[token for token in words if token in tokenized_sentence])))

要检查我们是否按顺序有 2 个或更多“匹配项”，解决方案取决于您的意思 words：它们的顺序是否重要？

即：如果 words = ['i','have','monkey','tell','about', 'me']（不是 'me'、'about'），'tell me about apples' 仍然匹配吗？我的猜测是它仍然会匹配，但是我会为您提供两种情况的解决方案。

如果 words 的令牌顺序很重要 ，您可以简单地检查匹配的令牌是否在 space 中被审句：

tuples = []
for sentence, tokenized_sentence in zip(example, example_sentences_tokens):

    matches = [token for token in words if token in tokenized_sentence]
    sequence = ' '.join(matches) # order of matches matters
    if sequence in sentence:
        tuples.append(((sentence, matches)))

print(tuples)

输出：

[('i have an apple', ['i', 'have']),
  ('i have a pear', ['i', 'have'])]

在words的token顺序无关紧要的情况下，可以取第一个匹配token的索引，检查下一个是否匹配在标记化的句子中仍然是 words:

的一部分

tuples = []
for sentence, tokenized_sentence in zip(example, example_sentences_tokens):
    #print(sentence, tokenized_sentence)
    #print([token for token in words if token in tokenized_sentence])
    matches = [token for token in words if token in tokenized_sentence]
    i = tokenized_sentence.index(matches[0])
    if tokenized_sentence[i+1] in matches:

        tuples.append(((sentence,matches)))

print(tuples)

输出：

[('i have an apple', ['i', 'have']),
 ('i have a pear', ['i', 'have']),
 ('tell me about apples', ['tell', 'about', 'me'])]

第二题

我想你会对 big_list 中的每组句子应用上述过程。我的建议是在每一轮的 tuples 中保留一个结果列表，以及 big_list 中检查的句子列表的索引：这样你就可以跟踪所有匹配组合，并根据索引计算出现的百分比。

Python：给定一个特殊标记列表，显示所述标记在句子子列表中的分布

Python: given a list of special tokens, show the distribution of said tokens across sublists of sentences

python

frequency

nltk

第一个问题

第二题