字符串列表中文本中的最长公共子字符串
Longest common substring in a text that is inside a list of strings
我遇到了一个类似于最长公共子串问题但有修改的问题。如下:
提供了一个字符串列表 lst
和一个字符串 text
。该字符串可能包含也可能不包含列表中存在的子字符串。我需要 lst
里面 text
的 first 最长子串,考虑到你从后面开始检查 text
。 first 和 from the back 的意思是你从最后一个词开始迭代 text
,匹配最长的子串,和 return 在遇到中断子字符串匹配的字符后。
例如,如果
lst = ['abcd', 'x', 'xy', 'xyz', 'abcdxyz']
text = 'abcd abcd xyz xyz'
那么答案就是文中最后一个xyz
因为你是从[=13=后面开始查的,它在lst
里面,是[=]的子串13=].
'abcd'
不是答案,因为它在 text
中出现在 xyz
之前
'abcdxyz'
不是答案,因为它是 text
中的子序列
此外,在 text
中,子字符串可以由任何不在 [A-Za-z]
内的字符分隔,但通常它们由空格分隔。
我需要一个算法来解决这个问题。伪代码或 Python 程序就可以了。
一些测试用例
lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east brisbane'
# answer is east brisbane
lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east east brisbane brisbane xyz'
# answer is brisbane
lst = ['sale', 'yarrabilba']
text = 'sale yarrabilba'
# answer is yarrabilba
lst = ['sale', 'yarrabilba']
text = 'abc fgh xyz'
# answer is None
text = 'A Street Name Some Words Suburb Name'
lst = ['A Street Name', 'Suburb Name']
# answer is 'Suburb Name'
text = 'A Street Name Some Words Name Suburb Name'
lst = ['A Street Name', 'Suburb Name', 'Name']
# answer is 'Suburb Name'
def findstem(arr):
# Determine size of the array
n = len(arr)
# Take first word from array
# as reference
s = arr[0]
l = len(s)
res = ""
for i in range(l):
for j in range(i + 1, l + 1):
# generating all possible substrings
# of our reference string arr[0] i.e s
stem = s[i:j]
k = 1
for k in range(1, n):
# Check if the generated stem is
# common to all words
if stem not in arr[k]:
break
# If current substring is present in
# all strings and its length is greater
# than current result
if (k + 1 == n and len(res) < len(stem)):
res = stem
return res
让我们使用您更好的评论示例:
In the original problem, text can be a string like 'A Street Name Some Words Suburb Name'
, and lst can be ['A Street Name', 'Suburb Name']
,
then I would like to match 'Suburb Name'
only. It is not possible that
'A Street Name'
comes after 'Suburb Name'
如果您必须找到句子的第一个匹配项,任务会很简单,您可以使用正则表达式和 re.finditer
。然后,让我们通过反转单词来重新输入并执行此操作!
text = 'A Street Name Some Words Suburb Name'
lst = ['A Street Name', 'Suburb Name']
import re
# define a helper function to reverse words
rev = lambda x: ' '.join(reversed(x.split()))
# invert words in the query
txet = rev(text)
# 'Name Suburb Words Some Name Street A'
# invert words in the searched strings
tsl = [rev(e) for e in sorted(lst, key=len, reverse=True)]
# ['Name Street A', 'Name Suburb']
# find "first" match
m = re.finditer('|'.join(tsl), txet)
try:
out = rev(next(m).group())
except StopIteration:
out = None
输出:'Suburb Name'
示例 #2:
lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east east brisbane xyz'
输出#2:'east brisbane'
我遇到了一个类似于最长公共子串问题但有修改的问题。如下:
提供了一个字符串列表 lst
和一个字符串 text
。该字符串可能包含也可能不包含列表中存在的子字符串。我需要 lst
里面 text
的 first 最长子串,考虑到你从后面开始检查 text
。 first 和 from the back 的意思是你从最后一个词开始迭代 text
,匹配最长的子串,和 return 在遇到中断子字符串匹配的字符后。
例如,如果
lst = ['abcd', 'x', 'xy', 'xyz', 'abcdxyz']
text = 'abcd abcd xyz xyz'
那么答案就是文中最后一个xyz
因为你是从[=13=后面开始查的,它在lst
里面,是[=]的子串13=].
'abcd'
不是答案,因为它在text
中出现在 'abcdxyz'
不是答案,因为它是text
中的子序列
xyz
之前
此外,在 text
中,子字符串可以由任何不在 [A-Za-z]
内的字符分隔,但通常它们由空格分隔。
我需要一个算法来解决这个问题。伪代码或 Python 程序就可以了。
一些测试用例
lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east brisbane'
# answer is east brisbane
lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east east brisbane brisbane xyz'
# answer is brisbane
lst = ['sale', 'yarrabilba']
text = 'sale yarrabilba'
# answer is yarrabilba
lst = ['sale', 'yarrabilba']
text = 'abc fgh xyz'
# answer is None
text = 'A Street Name Some Words Suburb Name'
lst = ['A Street Name', 'Suburb Name']
# answer is 'Suburb Name'
text = 'A Street Name Some Words Name Suburb Name'
lst = ['A Street Name', 'Suburb Name', 'Name']
# answer is 'Suburb Name'
def findstem(arr):
# Determine size of the array
n = len(arr)
# Take first word from array
# as reference
s = arr[0]
l = len(s)
res = ""
for i in range(l):
for j in range(i + 1, l + 1):
# generating all possible substrings
# of our reference string arr[0] i.e s
stem = s[i:j]
k = 1
for k in range(1, n):
# Check if the generated stem is
# common to all words
if stem not in arr[k]:
break
# If current substring is present in
# all strings and its length is greater
# than current result
if (k + 1 == n and len(res) < len(stem)):
res = stem
return res
让我们使用您更好的评论示例:
In the original problem, text can be a string like
'A Street Name Some Words Suburb Name'
, and lst can be['A Street Name', 'Suburb Name']
, then I would like to match'Suburb Name'
only. It is not possible that'A Street Name'
comes after'Suburb Name'
如果您必须找到句子的第一个匹配项,任务会很简单,您可以使用正则表达式和 re.finditer
。然后,让我们通过反转单词来重新输入并执行此操作!
text = 'A Street Name Some Words Suburb Name'
lst = ['A Street Name', 'Suburb Name']
import re
# define a helper function to reverse words
rev = lambda x: ' '.join(reversed(x.split()))
# invert words in the query
txet = rev(text)
# 'Name Suburb Words Some Name Street A'
# invert words in the searched strings
tsl = [rev(e) for e in sorted(lst, key=len, reverse=True)]
# ['Name Street A', 'Name Suburb']
# find "first" match
m = re.finditer('|'.join(tsl), txet)
try:
out = rev(next(m).group())
except StopIteration:
out = None
输出:'Suburb Name'
示例 #2:
lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east east brisbane xyz'
输出#2:'east brisbane'