字符串列表中文本中的最长公共子字符串

Longest common substring in a text that is inside a list of strings

我遇到了一个类似于最长公共子串问题但有修改的问题。如下:

提供了一个字符串列表 lst 和一个字符串 text。该字符串可能包含也可能不包含列表中存在的子字符串。我需要 lst 里面 textfirst 最长子串,考虑到你从后面开始检查 text firstfrom the back 的意思是你从最后一个词开始迭代 text,匹配最长的子串,和 return 在遇到中断子字符串匹配的字符后。

例如,如果

lst = ['abcd', 'x', 'xy', 'xyz', 'abcdxyz']
text = 'abcd abcd xyz xyz'

那么答案就是文中最后一个xyz因为你是从[=13=后面开始查的,它在lst里面,是[=]的子串13=].

此外,在 text 中,子字符串可以由任何不在 [A-Za-z] 内的字符分隔,但通常它们由空格分隔。

我需要一个算法来解决这个问题。伪代码或 Python 程序就可以了。

一些测试用例

lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east brisbane'
# answer is east brisbane

lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east east brisbane brisbane xyz'
# answer is brisbane

lst = ['sale', 'yarrabilba']
text = 'sale yarrabilba'
# answer is yarrabilba

lst = ['sale', 'yarrabilba']
text = 'abc fgh xyz'
# answer is None

text = 'A Street Name Some Words Suburb Name' 
lst = ['A Street Name', 'Suburb Name']
# answer is 'Suburb Name'

text = 'A Street Name Some Words Name Suburb Name' 
lst = ['A Street Name', 'Suburb Name', 'Name']
# answer is 'Suburb Name'
def findstem(arr):
 
    # Determine size of the array
    n = len(arr)
 
    # Take first word from array
    # as reference
    s = arr[0]
    l = len(s)
 
    res = ""
 
    for i in range(l):
        for j in range(i + 1, l + 1):
 
            # generating all possible substrings
            # of our reference string arr[0] i.e s
            stem = s[i:j]
            k = 1
            for k in range(1, n):
 
                # Check if the generated stem is
                # common to all words
                if stem not in arr[k]:
                    break
 
            # If current substring is present in
            # all strings and its length is greater
            # than current result
            if (k + 1 == n and len(res) < len(stem)):
                res = stem
 
    return res

让我们使用您更好的评论示例:

In the original problem, text can be a string like 'A Street Name Some Words Suburb Name', and lst can be ['A Street Name', 'Suburb Name'], then I would like to match 'Suburb Name' only. It is not possible that 'A Street Name' comes after 'Suburb Name'

如果您必须找到句子的第一个匹配项,任务会很简单,您可以使用正则表达式和 re.finditer。然后,让我们通过反转单词来重新输入并执行此操作!

text = 'A Street Name Some Words Suburb Name'
lst  = ['A Street Name', 'Suburb Name']

import re

# define a helper function to reverse words
rev = lambda x: ' '.join(reversed(x.split()))

# invert words in the query
txet = rev(text)
# 'Name Suburb Words Some Name Street A'

# invert words in the searched strings
tsl  = [rev(e) for e in sorted(lst, key=len, reverse=True)]
# ['Name Street A', 'Name Suburb']

# find "first" match    
m = re.finditer('|'.join(tsl), txet)
try:
    out = rev(next(m).group())
except StopIteration:
    out = None

输出:'Suburb Name'

示例 #2:

lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east east brisbane xyz'

输出#2:'east brisbane'