Python:子序列搜索
Python: Sub sequence search
我有一个序列 'abccabac' 和一个子序列 'abc'。
我需要获取序列中所有出现的子序列 'abc' 的索引。
这样做的内存有效方法是什么?
示例:
输入:
**sequence**: 'abccabac' **subsequence**: 'abc'
输出:
012
013
017
057
457
我能想到的绝对最简单的方法是使用 itertools.combinations
import itertools
sequence = "abccabac"
subsequence = "abc"
for combo in itertools.combinations(enumerate(sequence), len(subsequence)):
if "".join(pair[1] for pair in combo) == subsequence:
print([pair[0] for pair in combo])
如果您的实际序列包含许多甚至不在子序列中的字符,那么在开始 combinations
之前过滤掉不相关的字符肯定会显着提高效率:
char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
...
您也可以使用 all
而不是连接每个组合,它只会检查字符,直到发现一个字符不相等:
char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
if all(pair[1]==char for pair,char in zip(combo,subsequence)):
print([pair[0] for pair in combo])
这是我刚刚放在一起的一个不同的替代方案,我想算法可以进一步优化,但这是我能想到的最好的。 sub_sequence_generator
为子序列的每个字母创建一个索引列表,然后每个递归级别的 sub_helper
遍历从最后一个索引之后开始的子序列的下一个字母的所有索引字母.
import itertools
import bisect
def sub_helper(cur_i, combo, all_options):
#cur_i is the index of the letter in the subsequence
#cur_idx is the index of the sequence which contains that letter
if cur_i>0:
prev_i = combo[cur_i-1]
else:
prev_i = -1
cur_options = all_options[cur_i]
start_i = bisect.bisect(cur_options,prev_i)
cur_i_gen = itertools.islice(cur_options,start_i,None)
if cur_i+1 == len(all_options): #last one:
for cur_idx in cur_i_gen:
combo[cur_i] = cur_idx
yield tuple(combo)
else:
for cur_idx in cur_i_gen:
combo[cur_i] = cur_idx
yield from sub_helper(cur_i+1, combo, all_options)
def sub_sequence_generator(sequence,sub_seq):
indices_map = {c:[] for c in set(sub_seq)} #create a list for each character
for i,c in enumerate(sequence):
try:
indices_map[c].append(i,) #the helper is simpler if they are all stored as single element tuples
except KeyError:
pass
# now we have indices for each character of the sub_sequence
# we just need to put them in a sequence that corelates to the subsequence
chr_indices = tuple(indices_map[c] for c in sub_seq)
return sub_helper(0,[None]*len(chr_indices), chr_indices)
sequence = "abccabac"
subsequence = "abc"
for idxs in sub_sequence_generator(sequence,subsequence):
print(idxs)
就内存而言,这将为子序列的每个字符创建一个列表,其中包含该字符所在的主序列中的索引。这些列表的元组,以及不断更新的索引列表和每次组合 yield
ed 的元组,以及像 islice
这样的迭代器,所以这是非常有效的内存,但是因为我尚未对其进行广泛测试,我不能保证它没有错误。
我有一个序列 'abccabac' 和一个子序列 'abc'。 我需要获取序列中所有出现的子序列 'abc' 的索引。 这样做的内存有效方法是什么?
示例:
输入:
**sequence**: 'abccabac' **subsequence**: 'abc'
输出:
012
013
017
057
457
我能想到的绝对最简单的方法是使用 itertools.combinations
import itertools
sequence = "abccabac"
subsequence = "abc"
for combo in itertools.combinations(enumerate(sequence), len(subsequence)):
if "".join(pair[1] for pair in combo) == subsequence:
print([pair[0] for pair in combo])
如果您的实际序列包含许多甚至不在子序列中的字符,那么在开始 combinations
之前过滤掉不相关的字符肯定会显着提高效率:
char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
...
您也可以使用 all
而不是连接每个组合,它只会检查字符,直到发现一个字符不相等:
char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
if all(pair[1]==char for pair,char in zip(combo,subsequence)):
print([pair[0] for pair in combo])
这是我刚刚放在一起的一个不同的替代方案,我想算法可以进一步优化,但这是我能想到的最好的。 sub_sequence_generator
为子序列的每个字母创建一个索引列表,然后每个递归级别的 sub_helper
遍历从最后一个索引之后开始的子序列的下一个字母的所有索引字母.
import itertools
import bisect
def sub_helper(cur_i, combo, all_options):
#cur_i is the index of the letter in the subsequence
#cur_idx is the index of the sequence which contains that letter
if cur_i>0:
prev_i = combo[cur_i-1]
else:
prev_i = -1
cur_options = all_options[cur_i]
start_i = bisect.bisect(cur_options,prev_i)
cur_i_gen = itertools.islice(cur_options,start_i,None)
if cur_i+1 == len(all_options): #last one:
for cur_idx in cur_i_gen:
combo[cur_i] = cur_idx
yield tuple(combo)
else:
for cur_idx in cur_i_gen:
combo[cur_i] = cur_idx
yield from sub_helper(cur_i+1, combo, all_options)
def sub_sequence_generator(sequence,sub_seq):
indices_map = {c:[] for c in set(sub_seq)} #create a list for each character
for i,c in enumerate(sequence):
try:
indices_map[c].append(i,) #the helper is simpler if they are all stored as single element tuples
except KeyError:
pass
# now we have indices for each character of the sub_sequence
# we just need to put them in a sequence that corelates to the subsequence
chr_indices = tuple(indices_map[c] for c in sub_seq)
return sub_helper(0,[None]*len(chr_indices), chr_indices)
sequence = "abccabac"
subsequence = "abc"
for idxs in sub_sequence_generator(sequence,subsequence):
print(idxs)
就内存而言,这将为子序列的每个字符创建一个列表,其中包含该字符所在的主序列中的索引。这些列表的元组,以及不断更新的索引列表和每次组合 yield
ed 的元组,以及像 islice
这样的迭代器,所以这是非常有效的内存,但是因为我尚未对其进行广泛测试,我不能保证它没有错误。