如何生成此自定义字母数字序列?

How to generate this custom alpha-numeric sequence?

我想创建一个程序来生成特定的 7 长字符串。

必须遵循以下规则:

  1. 0-9在a-z之前,在A-Z之前

  2. 长度为 7 个字符。

  3. 每个字符必须不同于两个关闭(例如'NN'是不允许的)

  4. 我需要所有可能的组合,从 0000000 递增到 ZZZZZZZ,但不是随机序列

我已经用这段代码完成了:

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product

chars = digits + ascii_lowercase + ascii_uppercase

for n in range(7, 8):
    for comb in product(chars, repeat=n):
        if (comb[6] != comb[5] and comb[5] != comb[4] and comb[4] != comb[3] and comb[3] != comb[2] and comb[2] != comb[1] and comb[1] != comb[0]):
            print ''.join(comb)

但它根本不高效,因为我必须等待很长时间才能进行下一次组合。

有人可以帮助我吗?

极端情况这里不处理,可以这样处理

import random
from string import digits, ascii_uppercase, ascii_lowercase

len1 = random.randint(1, 7)
len2 = random.randint(1, 7-len1)
len3 = 7 - len1 - len2
print len1, len2, len3
result = ''.join(random.sample(digits, len1) + random.sample(ascii_lowercase, len2) + random.sample(ascii_uppercase, len3))

试试这个

import string
import random

a = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for _ in range(7))
print(a)

如果您想要的是符合上述规则的随机字符串,您可以使用如下内容:

def f():
  digitLen = random.randrange(8)
  smallCharLen = random.randint(0, 7 - digitLen)
  capCharLen = 7 - (smallCharLen + digitLen)
  print (str(random.randint(0,10**digitLen-1)).zfill(digitLen) +
      "".join([random.choice(ascii_lowercase) for i in range(smallCharLen)]) +
      "".join([random.choice(ascii_uppercase) for i in range(capCharLen)]))

我没有添加重复字符规则,但是如果你有字符串,使用字典很容易过滤掉不需要的字符串。您还可以通过对段长度设置条件来固定每个段的长度。

编辑:一个小错误。

原始实现生成第一个结果需要很长时间的原因是从 0000000 开始时需要很长时间才能达到 0101010 的第一个有效值使用产品时。

这是一个生成有效序列而不是丢弃无效序列的递归版本:

from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product

all_chars=[digits, ascii_lowercase, ascii_uppercase]

def seq(char_sets, start=None):
    for char_set in char_sets:
        for val in seqperm(char_set, start):
            yield val

def seqperm(char_set, start=None, exclude=None):
    left_chars, remaining_chars=char_set[0], char_set[1:]
    if start:
        try:
            left_chars=left_chars[left_chars.index(start[0]):]
            start=start[1:]
        except:
            left_chars=''
    for left in left_chars:
        if left != exclude:
            if len(remaining_chars) > 0:
                for right in seqperm(remaining_chars, start, left):
                    yield left + right
            else:
                yield left

if __name__ == "__main__":
    count=int(argv[1])
    start=None
    if len(argv) == 3:
        start=argv[2]
    # char_sets=list(combinations_with_replacement(all_chars, 7))
    char_sets=[[''.join(all_chars)] * 7]
    for idx, val in enumerate(seq(char_sets, start)):
        if idx == count:
            break
        print idx, val

运行如下:

./permute.py 10 

输出:

0 0101010
1 0101012
2 0101013
3 0101014
4 0101015
5 0101016
6 0101017
7 0101018
8 0101019
9 010101a

如果您传递一个额外的参数,那么脚本会跳到序列中以第三个参数开头的部分,如下所示:

./permute.py 10 01234Z

如果要求仅生成小写字母始终跟随数字且大写字母始终跟随数字和小写字母的排列,则注释掉行 char_sets=[[''.join(all_chars)] * 7] 并使用行 char_sets=list(combinations_with_replacement(all_chars, 7)).

上述命令行的示例输出 char_sets=list(combinations_with_replacement(all_chars, 7)):

0 01234ZA
1 01234ZB
2 01234ZC
3 01234ZD
4 01234ZE
5 01234ZF
6 01234ZG
7 01234ZH
8 01234ZI
9 01234ZJ

char_sets=[[''.join(all_chars)] * 7]:

相同命令行的示例输出
0 01234Z0
1 01234Z1
2 01234Z2
3 01234Z3
4 01234Z4
5 01234Z5
6 01234Z6
7 01234Z7
8 01234Z8
9 01234Z9

无需递归即可实现上述内容,如下所示。性能特征变化不大:

from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product, izip_longest

all_chars=[digits, ascii_lowercase, ascii_uppercase]

def seq(char_sets, start=''):
    for char_set in char_sets:
        for val in seqperm(char_set, start):
            yield val

def seqperm(char_set, start=''):
    iters=[iter(chars) for chars in char_set]
    # move to starting point in sequence if specified
    for char, citer, chars in zip(list(start), iters, char_set):
        try:
            for _ in range(0, chars.index(char)):
                citer.next()
        except ValueError:
            raise StopIteration
    pos=0
    val=''
    while True:
        citer=iters[pos]
        try:
            char=citer.next()
            if val and val[-1] == char:
                char=citer.next()
            if pos == len(char_set) - 1:
                yield val+char
            else:
                val = val + char
                pos += 1
        except StopIteration:
            if pos == 0:
                raise StopIteration
            iters[pos] = iter(chars)
            pos -= 1
            val=val[:pos]

if __name__ == "__main__":
    count=int(argv[1])
    start=''
    if len(argv) == 3:
        start=argv[2]
    # char_sets=list(combinations_with_replacement(all_chars, 7))
    char_sets=[[''.join(all_chars)] * 7]
    for idx, val in enumerate(seq(char_sets, start)):
        if idx == count:
            break
        print idx, val

带缓存的递归版本也是可能的,它生成结果的速度更快,但灵活性较低。

编辑:我已经更新了解决方案以使用长度大于 4 的缓存短序列。这显着加快了计算速度。使用简单版本,生成所有长度为 7 的序列需要 18.5 小时,但使用新方法只需 4.5 小时。

我将让文档字符串完成所有描述解决方案的讨论。

"""
Problem:
    Generate a string of N characters that only contains alphanumerical
    characters. The following restrictions apply:
        * 0-9 must come before a-z, which must come before A-Z
        * it's valid to not have any digits or letters in a sequence
        * no neighbouring characters can be the same
        * the sequences must be in an order as if the string is base62, e.g.,
          01010...01019, 0101a...0101z, 0101A...0101Z, 01020...etc

Solution:
    Implement a recursive approach which discards invalid trees. For example,
    for "---" start with "0--" and recurse. Try "00-", but discard it for
    "01-". The first and last sequences would then be "010" and "ZYZ".

    If the previous character in the sequence is a lowercase letter, such as
    in "02f-", shrink the pool of available characters to a-zA-Z. Similarly,
    for "9gB-", we should only be working with A-Z.

    The input also allows to define a specific sequence to start from. For
    example, for "abGH", each character will have access to a limited set of
    its pool. In this case, the last letter can iterate from H to Z, at which
    point it'll be free to iterate its whole character pool next time around.

    When specifying a starting sequence, if it doesn't have enough characters
    compared to `length`, it will be padded to the right with characters free
    to explore their character pool. For example, for length 4, the starting
    sequence "29" will be transformed to "29  ", where we will deal with two
    restricted characters temporarily.

    For long lengths the function internally calls a routine which relies on
    fewer recursions and cached results. Length 4 has been chosen as optimal
    in terms of precomputing time and memory demands. Briefly, the sequence is
    broken into a remainder and chunks of 4. For each preceeding valid
    subsequence, all valid following subsequences are fetched. For example, a
    sequence of six would be split into "--|----" and for "fB|----" all
    subsequences of 4 starting A, C, D, etc would be produced.

Examples:
    >>> for i, x in enumerate(generate_sequences(7)):
    ...    print i, x
    0, 0101010
    1, 0101012
    etc

    >>> for i, x in enumerate(generate_sequences(7, '012abcAB')):
    ...    print i, x
    0, 012abcAB
    1, 012abcAC
    etc

    >>> for i, x in enumerate(generate_sequences(7, 'aB')):
    ...    print i, x
    0, aBABABA
    1, aBABABC
    etc
"""

import string

ALLOWED_CHARS = (string.digits + string.ascii_letters,
                 string.ascii_letters,
                 string.ascii_uppercase,
                 )
CACHE_LEN = 4

def _generate_sequences(length, sequence, previous=''):
    char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
    if sequence[-length] != ' ':
        char_set = char_set[char_set.find(sequence[-length]):]
        sequence[-length] = ' '
    char_set = char_set.replace(previous, '')

    if length == 1:
        for char in char_set:
            yield char
    else:
        for char in char_set:
            for seq in _generate_sequences(length-1, sequence, char):
                yield char + seq

def _generate_sequences_cache(length, sequence, cache, previous=''):
    sublength = length if length == CACHE_LEN else min(CACHE_LEN, length-CACHE_LEN)
    subseq = cache[sublength != CACHE_LEN]
    char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
    if sequence[-length] != ' ':
        char_set = char_set[char_set.find(sequence[-length]):]
        index = len(sequence) - length
        subseq0 = ''.join(sequence[index:index+sublength]).strip()
        sequence[index:index+sublength] = [' '] * sublength
        if len(subseq0) > 1:
            subseq[char_set[0]] = tuple(
                    s for s in subseq[char_set[0]] if s.startswith(subseq0))
    char_set = char_set.replace(previous, '')

    if length == CACHE_LEN:
        for char in char_set:
            for seq in subseq[char]:
                yield seq
    else:
        for char in char_set:
            for seq1 in subseq[char]:
                for seq2 in _generate_sequences_cache(
                                length-sublength, sequence, cache, seq1[-1]):
                    yield seq1 + seq2

def precompute(length):
    char_set = ALLOWED_CHARS[0]
    if length > 1:
        sequence = [' '] * length
        result = {}
        for char in char_set:
            result[char] = tuple(char + seq for seq in  _generate_sequences(
                                                     length-1, sequence, char))
    else:
        result = {char: tuple(char) for char in ALLOWED_CHARS[0]}
    return result

def generate_sequences(length, sequence=''):
    # -------------------------------------------------------------------------
    # Error checking: consistency of the value/type of the arguments
    if not isinstance(length, int):
        msg = 'The sequence length must be an integer: {}'
        raise TypeError(msg.format(type(length)))
    if length < 0:
        msg = 'The sequence length must be greater or equal than 0: {}'
        raise ValueError(msg.format(length))
    if not isinstance(sequence, str):
        msg = 'The sequence must be a string: {}'
        raise TypeError(msg.format(type(sequence)))
    if len(sequence) > length:
        msg = 'The sequence has length greater than {}'
        raise ValueError(msg.format(length))
    # -------------------------------------------------------------------------
    if not length:
        yield ''
    else:
        # ---------------------------------------------------------------------
        # Error checking: the starting sequence, if provided, must be valid
        if any(s not in ALLOWED_CHARS[0]+' ' for s in sequence):
            msg = 'The sequence contains invalid characters: {}'
            raise ValueError(msg.format(sequence))
        if sequence.strip() != sequence.replace(' ', ''):
            msg = 'Uninitiated characters in the middle of the sequence: {}'
            raise ValueError(msg.format(sequence.strip()))
        sequence = sequence.strip()
        if any(a == b for a, b in zip(sequence[:-1], sequence[1:])):
            msg = 'No neighbours must be the same character: {}'
            raise ValueError(msg.format(sequence))
        char_type = [s.isalpha() * (2 - s.islower()) for s in sequence]
        if char_type != sorted(char_type):
            msg = '0-9 must come before a-z, which must come before A-Z: {}'
            raise ValueError(msg.format(sequence))
        # ---------------------------------------------------------------------
        sequence = list(sequence.ljust(length))
        if length <= CACHE_LEN:
            for s in _generate_sequences(length, sequence):
                yield s
        else:
            remainder = length % CACHE_LEN
            if not remainder:
                cache = tuple((precompute(CACHE_LEN),))
            else:
                cache = tuple((precompute(CACHE_LEN), precompute(remainder)))
            for s in _generate_sequences_cache(length, sequence, cache):
                yield s

我在 generate_sequences() 函数中包含了彻底的错误检查。为了简洁起见,如果您可以保证调用该函数的人永远不会使用无效输入,则可以删除它们。具体来说,无效的起始序列。

计算特定长度的序列数

虽然该函数将按顺序生成序列,但我们可以执行一个简单的组合计算来计算总共存在多少个有效序列。

序列可以有效地分解为 3 个独立的子序列。一般来说,一个序列可以包含 0 到 7 个数字,后跟 0 到 7 个小写字母,然后是 0 到 7 个大写字母。只要它们的总和是 7。这意味着我们可以有分区 (1, 3, 3),或 (2, 1, 3),或 (6, 0, 1),等等。我们可以使用 stars and bars to calculate the various combinations of splitting a sum of N into k bins. There is already an implementation for ,我们将借用它。前几个分区是:

[0, 0, 7]
[0, 1, 6]
[0, 2, 5]
[0, 3, 4]
[0, 4, 3]
[0, 5, 2]
[0, 6, 1]
...

接下来,我们需要计算一个分区中有多少个有效序列。由于数字子序列独立于小写字母,小写字母独立于大写字母,我们可以单独计算它们并将它们相乘。

那么,对于长度为 4 的数字,我们可以有多少种组合?第一个字符可以是 10 个数字中的任何一个,但第二个字符只有 9 个选项(十个减去前一个字符的数字)。第三个字母也类似,依此类推。所以有效子序列的总数是10*9*9*9。同样,对于长度为 3 的字母,我们得到 26*25*25。总的来说,对于分区,比如 (2, 3, 2),我们有 10*9*26*25*25*26*25 = 950625000 种组合。

import itertools as it

def partitions(n, k):
    for c in it.combinations(xrange(n+k-1), k-1):
        yield [b-a-1 for a, b in zip((-1,)+c, c+(n+k-1,))]

def count_subsequences(pool, length):
    if length < 2:
        return pool**length
    return pool * (pool-1)**(length-1)

def count_sequences(length):
    counts = [[count_subsequences(i, j) for j in xrange(length+1)] \
              for i in [10, 26]]

    print 'Partition {:>18}'.format('Sequence count')

    total = 0
    for a, b, c in partitions(length, 3):
        subtotal = counts[0][a] * counts[1][b] * counts[1][c]
        total += subtotal
        print '{} {:18}'.format((a, b, c), subtotal)
    print '\nTOTAL {:22}'.format(total)

总的来说,我们观察到虽然快速生成序列不是问题,但数量太多可能需要很长时间。长度 7 有 78550354750(785 亿)个有效序列,并且随着长度的增加,这个数字大约只增加 25 倍。

采用与@julian 类似的方法

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product, tee, chain, izip, imap

def flatten(listOfLists):
    "Flatten one level of nesting"
    #recipe of itertools
    return chain.from_iterable(listOfLists)

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    #recipe of itertools
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def eq_pair(x):
    return x[0]==x[1]

def comb_noNN(alfa,size):
    if size>0:
        for candidato in product(alfa,repeat=size):
            if not any( imap(eq_pair,pairwise(candidato)) ):
                yield candidato
    else:
        yield tuple()

def my_string(N=7):
    for a in range(N+1):
        for b in range(N-a+1):
            for c in range(N-a-b+1):
                if sum([a,b,c])==N:
                    for letras in product(
                            comb_noNN(digits,c),
                            comb_noNN(ascii_lowercase,b),
                            comb_noNN(ascii_uppercase,a)
                            ):
                        yield "".join(flatten(letras))

comb_noNN 生成遵循规则 3 的特定大小的所有字符组合,然后在 my_string 中检查加起来为 N 的所有长度组合并生成遵循规则 1 的所有字符串单独生成每个数字,小写和大写字母。

for i,x in enumerate(my_string())

的一些输出
0, '0101010'
...
100, '0101231'
...
491041580, '936gzrf'
...
758790032, '27ktxfi' 
...