寻找一个 python 函数来查找字符串中最长的连续重复子串
Looking for a python function to find the longest sequentially repeated substring in a string
我正在用 DNA 序列进行一些编码,我对查找连续重复的函数很感兴趣(这可能代表引物可以 'slip' 也就是做坏事的地方)。
我感兴趣的示例如下:
longest_repeat('ATTTTCCATGATGATG')
这将输出重复长度和坐标,在本例中为 9 长和 7:15。该函数应该在最后提取 ATGATGATG,因为它比 TTTT 重复和 TGATGA 重复长,它只会报告 ATGATGATG。如果是领带,我希望它可以报告所有领带,或者至少报告其中之一。
最好设置一个阈值,仅在这些连续重复超过特定长度时才报告它们。
我在 python 方面有一些经验,但这个具体问题让我很困惑,因为如果我编码效率低下并放入一个 50 个字符长的字符串,它可能会花很长时间。感谢所有帮助!
以下将非常有效地工作。它 returns 最长的序列,它的长度,它的起始索引和它的结束索引。如果有多个最大长度的序列,结果将是它们的列表。函数 longest(s, threshold) 中的第二个参数是所需的 threshold-minimum 长度:
import numpy as np
def b(n): #it returns the factors of an integer. It will be used in next function
r = np.arange(1, int(n ** 0.5) + 1)
x = r[np.mod(n, r) == 0]
return set(np.concatenate((x, n / x), axis=None))
def isseq(s): #it tests if a string is a sequence. Using the result from previous function it compares all smaller parts of the devided string to check if they are equal
l=[int(p) for p in sorted(list(b(len(s))))[:-1]]
for i in l:
if len(set(s[k*i:i*(k+1)] for k in range(len(s)//i)))==1:
return True
return False
def longest(s, threshold): #the main function that returns the lenghtier sequense or a list of them if they are multiple, using a threshold as minimum length
m=[]
for i in range(len(s), threshold-1, -1):
for k in range(len(s)-i+1):
if isseq(s[k:k+i]):
m.append([s[k:k+i], i, k, k+i-1])
if len(m)>0:
return m
return False
示例:
>>>s='ATTTTCCATGATGATGGST'
>>> longest(s, 1)
[['ATGATGATG', 9, 7, 15]]
>>> s='ATTTTCCATGATGATGGSTLWELWELWEGFRJGHIJH'
>>> longest(s, 1)
[['ATGATGATG', 9, 7, 15], ['LWELWELWE', 9, 19, 27]]
>>>s='ATTTTCCATGATGATGGSTWGTKWKWKWKWKWKWKWKWKWKWKWFRGWLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERFGTFRGFTRUFGFGRFGRGBHJ'
>>> longest(longs, 1)
[['LWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWER', 64, 48, 111]]
这是一个解决方案:
def longest_repeat(seq, threshold):
results = []
longest = threshold
# starting position
for i in range(len(seq)):
# pattern period
for p in range(1, (len(seq)-i)//2+1):
# skip unecessary combinations
if results != [] and results[-1][0] == i and results[-1][3] % p == 0: continue
# max possible number of repetitions
repetitions = len(seq)//p
# position within the pattern's period
for k in range(p):
# get the max repetitions the k-th character in the period can support
m = 1
while i+k+m*p < len(seq) and seq[i+k] == seq[i+k+m*p]:
m += 1
repetitions = min(m, repetitions)
# check if we're already below the best result so far
if repetitions*p < longest: break
# save the result if it's good
if repetitions > 1 and repetitions*p >= longest:
# overwrite lesser results
if repetitions*p > longest: results = []
# store the current one (with ample information)
results += [(i, seq[i:i+p], repetitions, repetitions*p)]
longest = max(longest, repetitions*p)
return results
逻辑是,您 运行 通过序列中的每个起始位置 (i
),检查每个合理的模式周期 (p
),对于该组合,您检查是否它们产生的子串至少与迄今为止最好的子串一样好(或阈值,如果尚未找到结果)。
结果是 (starting index, period string, repetitions, total length)
形式的元组列表。 运行 你的榜样
threshold = 5
seq = 'ATTTCCATGATGATG'
t = time.time()
results = longest_repeat(seq, threshold)
print("execution time :", time.time()-t)
for t in results:
print(t)
我们得到
exec : 0.00010848045349121094
(6, 'ATG', 3, 9)
从那里,获得完整匹配的字符串是微不足道的(只需执行 period_string * repetitions
)
对于 700 个字符的随机输入,执行时间约为 6.8 秒,而使用@IoaTzimas 的答案时约为 20.2 秒。
我正在用 DNA 序列进行一些编码,我对查找连续重复的函数很感兴趣(这可能代表引物可以 'slip' 也就是做坏事的地方)。
我感兴趣的示例如下:
longest_repeat('ATTTTCCATGATGATG')
这将输出重复长度和坐标,在本例中为 9 长和 7:15。该函数应该在最后提取 ATGATGATG,因为它比 TTTT 重复和 TGATGA 重复长,它只会报告 ATGATGATG。如果是领带,我希望它可以报告所有领带,或者至少报告其中之一。
最好设置一个阈值,仅在这些连续重复超过特定长度时才报告它们。
我在 python 方面有一些经验,但这个具体问题让我很困惑,因为如果我编码效率低下并放入一个 50 个字符长的字符串,它可能会花很长时间。感谢所有帮助!
以下将非常有效地工作。它 returns 最长的序列,它的长度,它的起始索引和它的结束索引。如果有多个最大长度的序列,结果将是它们的列表。函数 longest(s, threshold) 中的第二个参数是所需的 threshold-minimum 长度:
import numpy as np
def b(n): #it returns the factors of an integer. It will be used in next function
r = np.arange(1, int(n ** 0.5) + 1)
x = r[np.mod(n, r) == 0]
return set(np.concatenate((x, n / x), axis=None))
def isseq(s): #it tests if a string is a sequence. Using the result from previous function it compares all smaller parts of the devided string to check if they are equal
l=[int(p) for p in sorted(list(b(len(s))))[:-1]]
for i in l:
if len(set(s[k*i:i*(k+1)] for k in range(len(s)//i)))==1:
return True
return False
def longest(s, threshold): #the main function that returns the lenghtier sequense or a list of them if they are multiple, using a threshold as minimum length
m=[]
for i in range(len(s), threshold-1, -1):
for k in range(len(s)-i+1):
if isseq(s[k:k+i]):
m.append([s[k:k+i], i, k, k+i-1])
if len(m)>0:
return m
return False
示例:
>>>s='ATTTTCCATGATGATGGST'
>>> longest(s, 1)
[['ATGATGATG', 9, 7, 15]]
>>> s='ATTTTCCATGATGATGGSTLWELWELWEGFRJGHIJH'
>>> longest(s, 1)
[['ATGATGATG', 9, 7, 15], ['LWELWELWE', 9, 19, 27]]
>>>s='ATTTTCCATGATGATGGSTWGTKWKWKWKWKWKWKWKWKWKWKWFRGWLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERFGTFRGFTRUFGFGRFGRGBHJ'
>>> longest(longs, 1)
[['LWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWER', 64, 48, 111]]
这是一个解决方案:
def longest_repeat(seq, threshold):
results = []
longest = threshold
# starting position
for i in range(len(seq)):
# pattern period
for p in range(1, (len(seq)-i)//2+1):
# skip unecessary combinations
if results != [] and results[-1][0] == i and results[-1][3] % p == 0: continue
# max possible number of repetitions
repetitions = len(seq)//p
# position within the pattern's period
for k in range(p):
# get the max repetitions the k-th character in the period can support
m = 1
while i+k+m*p < len(seq) and seq[i+k] == seq[i+k+m*p]:
m += 1
repetitions = min(m, repetitions)
# check if we're already below the best result so far
if repetitions*p < longest: break
# save the result if it's good
if repetitions > 1 and repetitions*p >= longest:
# overwrite lesser results
if repetitions*p > longest: results = []
# store the current one (with ample information)
results += [(i, seq[i:i+p], repetitions, repetitions*p)]
longest = max(longest, repetitions*p)
return results
逻辑是,您 运行 通过序列中的每个起始位置 (i
),检查每个合理的模式周期 (p
),对于该组合,您检查是否它们产生的子串至少与迄今为止最好的子串一样好(或阈值,如果尚未找到结果)。
结果是 (starting index, period string, repetitions, total length)
形式的元组列表。 运行 你的榜样
threshold = 5
seq = 'ATTTCCATGATGATG'
t = time.time()
results = longest_repeat(seq, threshold)
print("execution time :", time.time()-t)
for t in results:
print(t)
我们得到
exec : 0.00010848045349121094
(6, 'ATG', 3, 9)
从那里,获得完整匹配的字符串是微不足道的(只需执行 period_string * repetitions
)
对于 700 个字符的随机输入,执行时间约为 6.8 秒,而使用@IoaTzimas 的答案时约为 20.2 秒。