在 python 中查找字符串列表的 *modal* 子字符串

Finding *modal* substring of a list of strings in python

寻找一个共同的子串已经在很多问题中得到了解答,即给定一个字符串列表找到[=]共同的(通常是最长或最大数量的单词)子串43=]全部。看这里:

Longest common substring from more than two strings - Python

我的问题是,如何在字符串列表中找到最长的 modal 子字符串?这里重要的规定是这个子字符串不一定要出现在列表中的所有字符串中。

这里的科学有点艺术,因为明显的权衡是 1) 您希望子字符串出现在多少个字符串中?和 2) 您希望子串有多长?为了修正想法,让我们假设我们希望所需的子字符串包含三个单词(如果在这里出现平局,则取最长的字符串,然后是第一个实例)。

所以给定列表,

mylist = ["hey how's it going?", "I'm fine  thanks.", "Did you get that thing?", "Of course I got that thing, that's why I asked you: how's it going?"]

期望的输出是,

"how's it going"

如果规定是两个字长,那么所需的输出将是,

"that thing"

因为 "that thing" 是比 "how's it" 或 "it going"

更长的字符串

以上代码中的答案分别是三词长和二词长的模态子串。


编辑:

由于有赏金,我将更具体地说明模态子字符串是什么。

模态子串:对于子串中给定长度的词(这是唯一标识模态子串所需要的),模态子串是与列表中最大数量的字符串。如果出现平局(即对于子串中给定长度的三个单词,有两个候选子串均出现在 80% 的字符串中),则应使用字符长度最长的子串。如果在那之后仍然是平手(这应该不太可能,但很好解释),那么就拿第一个或随机选择。


A good answer 将具有一个函数,即 returns 子字符串中给定数量的单词的模态子字符串(其中单词的数量可以是任意的数)。

一个 incredible 答案将免除 'given number of words' 限制,而是包含一个标量(比如 \alpha)来管理子字符串长度(以单词为单位)和它出现在列表中的次数。接近 1 的 Alpha 会选择一个很长(单词)但不一定在列表中出现多次的模态子串。接近 0 的 Alpha 会选择在列表中出现次数尽可能多且不关心子串长度的模态子串。我并不是真的期待这一点,并且会接受回答原始问题的答案。

寻找给定长度的模态子串

我们的输入如下:

mylist = ["hey how's it going?", "I'm fine  thanks.", "Did you get that thing?", "Of course I got that thing, that's why I asked you: how's it going?"]

首先,我们将定义一个预处理函数,以清理多余的空格和标点符号。这使得从句子中提取单词变得更加容易:

def preprocess(strings):
    # used this answer to get rid of extra whitespaces: 
    return [" ".join(string.split()).replace(",", "").replace(".", "").replace("?", "") for string in strings]

我们还将定义一个辅助函数来查找 n-grams(句子中 n 个单词的子串):

def find_n_grams(string, n):
    words = string.split(" ")
    n_grams = []

    for i in range(len(words) - n + 1):
        n_grams.append(" ".join([words[i + idx] for idx in range(n)]))

    return n_grams

然后,我们可以使用以下函数为给定数量的单词找到问题中定义的 "modal substring"。我不确定这是否是最 optimal/efficient 的计算方式,但它确实可以:

def find_modal_substring(strings, num_words):
    n_grams_per_string = [find_n_grams(string, num_words) for string in strings]
    max_num_occurences = 0
    modal_substring = None

    for i in range(len(strings)):
        n_grams = n_grams_per_string[i]

        for n_gram in n_grams:
            num_occurences = 1

            for j in range(i + 1, len(strings)):
                if n_gram in n_grams_per_string[j]:
                    num_occurences += 1

            if num_occurences > max_num_occurences:
                max_num_occurences = num_occurences
                modal_substring = n_gram
            elif num_occurences == max_num_occurences and len(modal_substring) < len(n_gram):
                max_num_occurences = num_occurences
                modal_substring = n_gram

    return modal_substring

一些input/output对:

> print(find_modal_substring(preprocess(mylist), 3))

how's it going

> print(find_modal_substring(preprocess(mylist), 2))

that thing


查找带 alpha 参数的模态子串

这部分题中最难的是如何定义给定alpha参数的得分函数。我们知道:

Alpha close to 1 would choose a modal substring that is very long (in words) but doesn't necessarily appear many times in the list. Alpha close to 0 would choose a modal substring that appears in as many times in the list as possible and doesn't care about substring length.

下面是一个满足此要求的得分函数,但我们可能还可以想到许多其他函数。通过将它隔离在一个函数中,任何人都可以根据自己的需要轻松修改它:

def compute_score(n_gram, occurences, alpha):
    return alpha * len(n_gram.split(" ")) + (1.0 - alpha) * occurences

鉴于此,查找 alpha-based 模态子串的函数相当长,因为它首先需要为所有 [=52= 找到所有可能的 n-gram ]n,但我们开始了:

def find_modal_substring_alpha(strings, alpha):
    n_grams_per_n_per_string = []
    n = 1

    while True:
        n_grams_per_string = [find_n_grams(string, n) for string in strings]

        if all(n_grams == [] for n_grams in n_grams_per_string):
            break
        else:
            n_grams_per_n_per_string.append(n_grams_per_string)

        n += 1

    best_modal_substring = None
    best_score = 0.0

    for n_grams_per_string in n_grams_per_n_per_string:
        for i in range(len(strings)):
            n_grams = n_grams_per_string[i]

            for n_gram in n_grams:
                num_occurences = 1

                for j in range(i + 1, len(strings)):
                    if n_gram in n_grams_per_string[j]:
                        num_occurences += 1

                score = compute_score(n_gram, num_occurences, alpha)

                if score > best_score:
                    best_score = score
                    best_modal_substring = n_gram
                elif score == best_score and len(best_modal_substring) < len(n_gram):
                    best_score = score
                    best_modal_substring = n_gram

    return best_modal_substring

原来这里alpha有点难调。即使对于相对较低的 alpha (0.3),我们得到的结果与 1.0:

的结果完全相同
> print(find_modal_substring_alpha(preprocess(mylist), 0.3))

Of course I got that thing that's why I asked you: how's it going

例如,如果我们将 alpha 一直向下移动到 0.01,我们会得到:

> print(find_modal_substring_alpha(preprocess(mylist), 0.01))

how's it going