计算 Python 中重复字符子串的频率?考虑字符的最大重复次数,而不是 count() 中看到的最小重复次数

Counting the frequency of a repeating character substring in Python? Consider max repetition of the character not the min as seen in count( )

required_function() 所需的函数调用和输出预计如下所示:

>>> print(required_function("1112211000022", "1"))
0
>>> print(required_function("1112211000022", "11"))
1
>>> print(required_function("1112211000022", "111"))
1 
>>> print(required_function("1112211000022", "0"))
0
>>> print(required_function("1112211000022", "00"))
0
>>> print(required_function("1112211000022", "0000"))
1
>>> print(required_function("1112211000022", "22"))
2

让我们试试 python 的 count() 函数,但它使用搜索子字符串的最小匹配长度提取非重叠计数。

"1112211000022".count('11') #returns 2 but desired output is 1 because "11" is present only once.
"1112211000022".count('1') #returns 5 but desired output is 0 because "1" is not present.
"1112211000022".count('0') #returns 4 but desired output is 0 because "0" is not present.

目标: 此函数有助于计算蛋白质序列中氨基酸的出现次数。我也在尝试编写一个高效的代码,完成后会尽快分享。该函数应该具有时间效率,因为在与单个 Specie 相关的单个文件中将有大约 50,000 个平均长度为 250 的蛋白质序列。

如果您要进行此类分析,您应该精通 regular expressions。它们是分析此类序列的常用工具。通过使用负 lookahead/lookbehind 设置一个正则表达式来轻松处理您当前的问题,该正则表达式查找不跟随且不跟随模式中的字符的目标。一种方法是:

import re

def required_function(s, t):
    c = t[0]
    rx = re.compile(f'(?<!{c}){t}(?!{c})')
    return len(rx.findall(s))

assert(0 == required_function("1112211000022", "1"))
assert(1 == required_function("1112211000022", "11"))
assert(1 == required_function("1112211000022", "111"))
assert(0 == required_function("1112211000022", "0"))
assert(0 == required_function("1112211000022", "00"))
assert(1 == required_function("1112211000022", "0000"))
assert(2 == required_function("1112211000022", "22"))

这将在几毫秒内处理 50,000 个示例,并且可以通过尽可能明智地重新使用已编译的正则表达式来改进。