Itertools 停止字符连续重复

Itertools stop characters repeating in a row

我写了下面的代码,用 A、T、G 和 C 的组合来制作所有长度为 20 个字符的字符串。

但是,我想避免连续出现 3 个以上的相同字符,因此我添加了一个 if 函数来检查这一点。问题是,这是在 itertools 代码之后,所以有点慢。我想知道是否有一种方法可以使用 itertools 来产生这个结果,而不必 运行 itertools 然后是 if 函数?

import sys
import itertools
import re

x = ["A","T","G","C"]
for i in itertools.product(x, repeat=20):
        i = "".join(i)
        if re.search(r"(\w)",i):
                continue
        else:
                sys.stdout.write(i)

从表面上看,问题似乎是在问这个:

How can I filter this enormous list of strings without the pain of having to construct the whole list first?

答案是:您已经在做! itertools 中的事物产生延迟生成的序列,这些序列是迭代构建的。因此,您现有的代码 而不是 生成包含数十亿个字符串的庞大列表。​​

但是您可能想问一个更有趣的问题:

If I generate the triplet-free strings by generating all the strings and filtering out the ones with triplets in, my code is having to do extra work because most of the strings generated will have triplets in them. Suppose the strings are generated in lexicographic order; then the first 4**17 of them will begin AAA, and we really ought to be able to skip over all of those. How can we do better?

不幸的是,如果您想这样做这个,那么您将不得不编写自己的代码来完成它; itertools 不提供这种 "pattern-filtered product" 功能。

它可能看起来像这样:

# generate all n-tuples with the property that their k-th element
# is one of the things returned by successors(initial (k-1)-tuple).
# So e.g. the first element is one of the things returned by
# successors(()).
def restricted_tuples(successors, n):
    assert(n>=0)
    if n==0:
        for t in successors(()): yield (t,)
    else:
        for start in restricted_tuples(successors, n-1):
            for t in successors(start): yield start+(t,)

def successors_no_triples(start, alphabet):
    if len(start)<2 or start[-1] != start[-2]:
        for t in alphabet: yield t
    else:
        banned = start[-1]
        for t in alphabet:
            if t != banned: yield t

print([''.join(x) for x in restricted_tuples(lambda start: successors_no_triples(start,'ABC'), 5)])

末尾的 print 仅供参考。如果您想从原始提问者的案例中打印出所有数十亿个字符串,您最好迭代 restricted_tuples 生成的序列并分别对每个字符串进行字符串化和打印。

顺带一提,带有这个属性的4个字母长度为20的序列的个数是415,289,569,968。如果您尝试全部生成它们,您将需要等待一段时间,特别是如果您真的想要对每个生成任何事情。