有效地在字符串中找到给定的子序列,最大化连续字符的数量

Efficiently find a given subsequence in a string, maximizing the number of contiguous characters

长问题描述

模糊字符串匹配器实用程序,如 fzf or CtrlP 过滤字符串列表,以查找具有给定搜索字符串作为子序列的字符串。 例如,假设用户想要在文件列表中搜索特定照片。查找文件

/home/user/photos/2016/pyongyang_photo1.png

键入 ph2016png 就足够了,因为此搜索字符串是此文件名的子序列。 (请注意,这不是 LCS。整个搜索字符串必须是文件名的子序列。)

检查给定的搜索字符串是否是另一个字符串的子序列是微不足道的,但我想知道如何有效地获得最佳匹配:在上面的例子中,有多个可能的匹配项。一个是

/home/user/<b>ph</b>otos/<b>2016</b>/<b>p</b>yo<b>n</b>gyan<b>g</b>_photo1.png

但用户可能想到的是

/home/user/<b>ph</b>otos/<b>2016</b>/pyongyang_photo1.<b>png</b>

为了形式化,我将 "best" 匹配定义为由最少数量的子字符串组成的匹配。第一个示例匹配的数字是 5,第二个示例匹配的数字是 3。

我想到这个是因为获得最佳匹配来为每个结果分配分数以进行排序会很有趣。虽然我对近似解不感兴趣,但我对这个问题的兴趣主要是学术性质的。

tl;dr 问题描述

给定字符串 st,在 t 的等于 s 的子序列中找到一个最大化连续元素对数的子序列在 t.

到目前为止我尝试了什么

为了便于讨论,我们将搜索查询称为 s 并将要测试的字符串称为 t。问题的解决方案表示为 fuzzy(s, t)。我将使用 Python 的字符串切片符号。最简单的方法如下:

由于任何解决方案都必须按顺序使用 s 中的所有字符,解决此问题的算法可以从搜索 t 中首次出现的 s[0] 开始(索引为 i),然后使用两种解决方案中较好的一种

t[:i+1] + fuzzy(s[1:], t[i+1:])    # Use the character
t[:i]   + fuzzy(s,     t[i+1:])    # Skip it and use the next occurence 
                                   # of s[0] in t instead

这显然不是解决这个问题的最佳方案。相反,这是明显的蛮力。 (我尝试过同时搜索最后一次出现的 s[-1] 并在该问题的早期版本中使用此信息,但事实证明这种方法不起作用。)


→ 我的问题是:这个问题最有效的解决方案是什么?

这可能不是 有效的解决方案,但它是一种高效且易于实施的解决方案。为了说明,我会借用你的例子。设 /home/user/photos/2016/pyongyang_photo1.png 文件名 ph2016png 输入

第一步(预计算)是可选的,但可能有助于加快下一步(设置)的速度,尤其是当您将算法应用于许多文件名时。

预计算
创建一个 table 来计算输入中每​​个字符的出现次数。由于您可能只处理 ASCII 字符,因此 256 个条目就足够了(可能是 128 个,甚至更少,具体取决于字符集)。

"ph2016png"
['p'] : 2
['h'] : 1
['2'] : 1
['0'] : 1
['b'] : 0
...

设置
通过丢弃输入中不存在的字符,将文件名分割成子字符串。同时,检查输入的每个字符在文件名中出现的次数是否正确(如果已完成预计算)。最后,检查输入的每个字符是否按顺序出现在子字符串列表中。如果将子字符串列表作为单个字符串,对于该字符串的任何给定字符,在输入中在它之前找到的每个字符都必须在该字符串中在它之前找到。这可以在创建子字符串时完成。

"/home/user/photos/2016/pyongyang_photo1.png"
"h", "ph", "2016", "p", "ng", "ng", "ph", "1", "png"
'p' must come before "h", so throw this one away
"ph", "2016", "p", "ng", "ng", "ph", "1", "png"

核心
将最长的子字符串与输入匹配并跟踪最长的匹配。此匹配可以保留子字符串的开头(例如,将 ababa(子字符串)与 babaa(输入)匹配将导致 aba,而不是 baba),因为它更容易实施,尽管它不是必须的。如果你没有得到一个完整的匹配,使用最长的一个来再次分割子串,然后用下一个最长的子串重试。

Since there is no instance of incomplete match with your example,
let's take something else, made to illustrate the point.
Let's take "babaaababcb" as the filename, and "ababb" as input.
Substrings : "abaaabab", "b"
Longest substring : "abaaabab"

If you keep the beginning of matches
Longest match : "aba"
Slice "abaaabab" into "aba", "aabab"
-> "aba", "aabab", "b"
Retry with "aabab"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)

Otherwise (harder to implement, not necessarily better performing, as shown in this example)
Longest match : "abab"
Slice "abaaabab" into "abaa", "abab"
-> "abaa", "abab", "b"
Retry with "abaa"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)

如果您确实获得了完全匹配,请继续将输入以及子字符串列表一分为二,然后重复匹配最长的子字符串。

With "ph2016png" as input
Longest substring : "2016"
Complete match
Match substrings "h", "ph" with input "ph"
Match substrings "p", "ng", "ng", "ph", "1", "png" with input "png"

您一定会找到包含最少子串的子串序列,因为您先尝试最长的子串。如果输入不包含文件名中的许多短子字符串,这通常会表现良好。

我建议创建一个搜索树,其中每个节点代表大海捞针中与其中一个字符匹配的字符位置。

顶部节点是兄弟节点,代表大海捞针中第一个针字符的出现。

一个parent节点的children是表示大海捞针中下一个针字符出现的那些节点,但只是那些位于[=38]表示的位置之后的节点=]节点.

这在逻辑上意味着一些children被几个parent共享,所以这个结构并不是真正的树,而是一个有向无环图。一些兄弟 parent 甚至可能具有完全相同的 children。其他 parent 可能根本没有 children:它们是 dead-end,除非它们位于图表底部,其中叶子表示最后一个针字符的位置。

设置此图后,在其中进行 depth-first 搜索可以很容易地得出从某个节点开始仍需要的段数,然后在备选方案中将其最小化。

我在下面的 Python 代码中添加了一些注释。此代码可能仍需改进,但与您的解决方案相比,它似乎已经相当高效了。

def fuzzy_trincot(haystack, needle, returnSegments = False):
    inf = float('inf')

    def getSolutionAt(node, depth, optimalCount = 2):
        if not depth: # reached end of needle
            node['count'] = 0
            return
        minCount = inf # infinity ensures also that incomplete branches are pruned
        child = node['child']
        i = node['i']+1
        # Optimisation: optimalCount gives the theoretical minimum number of  
        # segments needed for any solution. If we find such case, 
        # there is no need to continue the search.
        while child and minCount > optimalCount:
            # If this node was already evaluated, don't lose time recursing again.
            # It works without this condition, but that is less optimal.
            if 'count' not in child:
                getSolutionAt(child, depth-1, 1)
            count = child['count'] + (i < child['i'])
            if count < minCount:
                minCount = count
            child = child['sibling']
        # Store the results we found in this node, so if ever we come here again,
        # we don't need to recurse the same sub-tree again.
        node['count'] = minCount

    # Preprocessing: build tree
    # A node represents a needle character occurrence in the haystack.
    # A node can have these keys:
    #   i:       index in haystack where needle character occurs
    #   child:   node that represents a match, at the right of this index, 
    #            for the next needle character
    #   sibling: node that represents the next match for this needle character
    #   count:   the least number of additional segments needed for matching the 
    #            remaining needle characters (only; so not counting the segments
    #            already taken at the left)
    root = { 'i': -2, 'child': None, 'sibling': None }
    # Take a short-cut for when needle is a substring of haystack
    if haystack.find(needle) != -1:
        root['count'] = 1
    else:
        parent = root
        leftMostIndex = 0
        rightMostIndex = len(haystack)-len(needle)
        for j, c in enumerate(needle):
            sibling = None
            child = None
            # Use of leftMostIndex is an optimisation; it works without this argument
            i = haystack.find(c, leftMostIndex)
            # Use of rightMostIndex is an optimisation; it works without this test
            while 0 <= i <= rightMostIndex:
                node = { 'i': i, 'child': None, 'sibling': None }
                while parent and parent['i'] < i:
                    parent['child'] = node
                    parent = parent['sibling']
                if sibling: # not first child
                    sibling['sibling'] = node
                else: # first child
                    child = node
                    leftMostIndex = i+1
                sibling = node
                i = haystack.find(c, i+1)
            if not child: return False
            parent = child
            rightMostIndex += 1
        getSolutionAt(root, len(needle))

    count = root['count']
    if not returnSegments:
        return count

    # Use the `returnSegments` option when you need the character content 
    # of the segments instead of only the count. It runs in linear time.

    if count == 1: # Deal with short-cut case 
        return [needle]
    segments = []
    node = root['child']
    i = -2
    start = 0
    for end, c in enumerate(needle):
        i += 1
        # Find best child among siblings
        while (node['count'] > count - (i < node['i'])):
            node = node['sibling']
        if count > node['count']:
            count = node['count']
            if end:
                segments.append(needle[start:end])
                start = end
        i = node['i']
        node = node['child']
    segments.append(needle[start:])
    return segments

可以使用可选的第三个参数调用该函数:

haystack = "/home/user/photos/2016/pyongyang_photo1.png"
needle = "ph2016png"

print (fuzzy_trincot(haystack, needle))

print (fuzzy_trincot(haystack, needle, True))

输出:

3
['ph', '2016', 'png']

由于函数优化为 return 只有计数,第二次调用将增加一点执行时间。