使用 Gensim.Phrases 从单个句子中提取重复的多词术语我做错了什么？

Question

我想首先使用 Gensim Phrases 从单个句子中提取重复的 n-gram，然后使用它们来去除句子中的重复项。像这样：

Input: "Testing test this test this testing again here testing again here"

Desired output: "Testing test this testing again here"

我的代码似乎可以使用多个句子生成多达 5 克的语法，但每当我向它传递一个句子（甚至是一个充满相同句子的列表）时，它不起作用。如果我传递一个句子，它会将单词拆分为字符。如果我通过充满相同句子的列表，它会检测像非重复词这样的废话，而不会检测重复词。

我认为我的代码可以正常工作，因为我使用了大约 30MB 的文本并生成了非常易于理解的 n-gram，最多 n=5，这似乎符合我的预期。不过，我不知道如何判断它的精确度和召回率。这是完整的函数，它递归地生成从 2 到 n::

的所有 n-gram

def extract_n_grams(documents, maximum_number_of_words_per_group=2, threshold=10, minimum_count=6, should_print=False, should_use_keywords=False):
    from gensim.models import Phrases
    from gensim.models.phrases import Phraser

    tokens = [doc.split(" ") for doc in documents] if type(documents) == list else [documents.split(" ") for _ in range(100)] # this is what I tried

    final_n_grams = []
    for current_n in range(maximum_number_of_words_per_group - 1):
        n_gram = Phrases(tokens, min_count=minimum_count, threshold=threshold, connector_words=connecting_words)

        n_gram_phraser = Phraser(n_gram)

        resulting_tokens = []
        for token in tokens:
            resulting_tokens.append(n_gram_phraser[token])

        current_n_gram_final = []
        for token in resulting_tokens:
            for word in token:
                if '_' in word:
                    # no n_gram should have a comma between words
                    if ',' not in word:
                        word = word.replace('_', ' ')

                        if word not in current_n_gram_final and all([word not in gram for gram in final_n_grams]):
                            current_n_gram_final.append(word)

        tokens = n_gram[tokens]

        final_n_grams.append(current_n_gram_final)

除了尝试重复列表中的句子外，我还按照建议尝试使用 NLKT 的 word_tokenize。我究竟做错了什么？有没有更简单的方法？

Answer 1

Gensim Phrases class 旨在以统计方式检测某些单词对何时经常一起出现，而不是单独出现，因此将它们组合成一个标记可能很有用。

因此，它不太可能对您的示例任务有帮助，即消除重复的 3-word ['testing', 'again', 'here'] 运行-of-tokens。

首先，它从不消除标记——只是组合它们。所以，如果它看到对联['again', 'here']经常一起出现，而不是分开'again'和'here'，它会把它变成'again_here' – 不消除它。

但是其次，它不是针对每个重复的 n-token 分组进行这些组合，而是仅如果大量训练数据暗示，基于 threshold配置，某些对突出。（如果重复运行，它只会超出对。）您的示例 3 字分组 ['testing', 'again', 'here'] 似乎不太可能作为超可能配对的组合突出显示。

如果您对哪些 tokens/runs-of-tokens 需要被删除有更严格的定义，您可能希望运行其他 Python 标记列表中的代码以执行重复数据删除。您能否更详细地描述您想要删除的 n-gram 的种类，或许可以使用更多示例？（它们只会出现在文本的开头或结尾，还是中间？它们是否必须彼此相邻，或者它们是否可以散布在整个文本中？为什么数据中会出现这样的重复项，以及为什么删除它们很重要吗？）

更新： 根据对真正目标的评论，Python 的几行检查在标记列表中的每个位置是否有下一个N 个标记与前面的 N 个标记匹配（因此可以忽略）应该可以解决问题。例如：

def elide_repeated_ngrams(list_of_tokens):
    return_tokens = [] 
    i = 0
    while i < len(list_of_tokens):
        for candidate_len in range(len(return_tokens)):
            if list_of_tokens[i:i+candidate_len] == return_tokens[-candidate_len:]:
                i = i + candidate_len  # skip the repeat
                break  # begin fresh forward repeat-check
        else:
            # this token not part of any repeat; include & proceed
            return_tokens.append(list_of_tokens[i])
            i += 1
    return return_tokens

关于你的测试用例：

>>> elide_repeated_ngrams("Testing test this test this testing again here testing again here".split())
['Testing', 'test', 'this', 'testing', 'again', 'here']

使用 Gensim.Phrases 从单个句子中提取重复的多词术语我做错了什么？

What am I doing wrong using Gensim.Phrases to extract repeating multiple-word terms from within a single sentence?

python

gensim