生成具有值出现次数限制的值的组合

Question

我的任务是：

生成一个唯一组合列表，其中每个组合都有一定的长度 (var com_len) 并包含给定列表 (var values) 中的值 (int)，
每个组合都是通过从给定列表中取随机值创建的（随机性非常重要！），
每个组合里面的值必须是唯一的，组合里面的值不能重复，
组合中的值必须排序，
计算整个组合集合中每个值的出现次数（var counter），
每个值在整个数据集中出现的次数必须尽可能接近给定次数 (var counter_expected)。 "as close as possible" 意思是，随着脚本的进行计算每个出现的值，如果没有更多的组合可以创建，就结束脚本。

例如，我需要生成一个组合列表，其中每个组合的长度为 3，内部具有唯一的排序值，每个值都来自范围 (256) 并且每个值都出现在生成的所有组合中尽可能接近100次。

我的问题是，如何有效地检测到没有可以创建的值的唯一组合来停止循环。

当脚本快要结束时，问题出现了，但仍然有可用的值， len(available_values) > com_len 但不可能创建任何尚未出现的新独特组合。

目前创建的代码：

import numpy as np
import random

com_len = 3
length = 256
counter = np.zeros(length)
values = range(length)
exclude_values = []
counter_expected = 100
done = []

mask = np.ones(len(np.array(values)), np.bool)
mask[exclude_values] = False
available_values = set(values) - set(exclude_values)
available_values = list(available_values)

ii = 0
while True:
    """print progress"""
    ii = ii + 1
    if not ii % 1000: print('.', end='')

    #POSSIBLE CONDITION HERE
    ex = random.sample(available_values, k=com_len)
    ex.sort()
    if ex in done: continue
    done.append(ex)

    counter[ex] = counter[ex] + 1

    for l_ in ex:
        if counter[l_] == counter_expected:
            del available_values[available_values.index(l_)]

    if len(available_values) < com_len:break

    if all(counter[mask] == counter_expected): break
    #OR HERE

注意：脚本通常会成功结束，因为 len(available_values) < com_len 或 all(counter[mask] == counter_expected) 条件中断While 循环。多次尝试运行脚本，在某个时候，您会观察到脚本进入无限循环，如 len(available_values) >= com_len 但不再有新的独特条件可以创建，因此计数器不会增加。

我需要一个有效的条件来停止脚本。在这里使用 itertools.combinations 不是一个选项，因为 available_values 列表可能很长，即开头有 10k 个元素。

当 len(available_values) 达到一定水平时，蛮力将使用 itertools.combinations 并检查是否有任何尚未创建的组合，但这是一个丑陋的解决方案。

可能有更好的方法，我没有想到但你可能会想到。我会很感激你的帮助。

Answer 1

更新的最终答案：

我想出了以下符合我需要的代码。备注：

该功能在很多方面都不是最好的，但它的工作非常好！
该函数有 3 种数据生成模式：生成总组合数、生成每个值在所有组合中出现次数最少的组合、生成每个值出现 "max" 次的组合值出现在所有组合中（"max" 表示 "as close as possible to max value"）。
该功能允许在选定范围内即时更改组合的长度或提供特定数字。
根据参数，该函数可以执行冗余次数的迭代，如 'Total number of errors' 所示。但是...
速度很快！它使用集合和元组来获得出色的性能。唯一的问题可能发生在 itertools.combinations 触发返回吨（数百万）组合时，但就我而言，到目前为止从未发生过。

代码：

import numpy as np
import random
import itertools
from decimal import Decimal

def get_random_combinations(values, exclude, length, mode, limit, min_ = 0, max_ = 0):
    done = set()

    try:
        """Creating counter"""
        counter = np.zeros(len(values), np.uint)

        """Create a mask for excluded values"""
        """"""
        mask = np.ones(len(np.array(values)), np.bool)
        mask[exclude] = False

        """available values to create combinations"""
        values_a = set(values) - set(exclude)
        values_a = list(values_a)

        if length == 1: 
            if mode == 'total':
                """generate just data_number of examples"""
                for ii in range(limit):
                    comb = random.sample(values_a, 1)[0]
                    del values_a[values_a.index(comb)]
                    done.add(tuple([comb]))
            else:
                """generate one example for each comb"""
                for comb in values_a: done.add(tuple([comb]))
        else: 
            """total number of combinations"""
            if isinstance(length, str): rr = np.mean([min_, max_])
            else: rr = length
            nn = len(values_a)
            comb_max = int(Decimal(np.math.factorial(nn)) / Decimal(np.math.factorial(rr) * np.math.factorial(nn-rr)))
            err_limit = int(comb_max * 0.01)
            if err_limit > 10000: err_limit = 10000

            """initiate variables"""
            #should itertools be used to generate the rest of combinations
            gen_comb = False 
            #has all combinations generated by itertools ended
            comb_left_0 = False 
            #has the limit of errors been reached to generate itertools combinations
            err_limit_reached = False 
            #previous combination 
            ll_prev = 0 
            dd = 0 #done counter
            comb_left = set() #itertools  combinations
            err = 0 #errors counter

            """options variables for statistics"""
            err_t = 0 #total number of errors
            gen = 0 #total number of generations of itertools.combinations
            ii = 0 #total number of iterations

            print('GENERATING LIST OF COMBINATIONS')
            while True:
                """print progress"""
                ii = ii + 1
                if not dd % 1000: print('.', end='')

                """check if length of combs is random or not"""
                if isinstance(length, str): 
                    """change max_ length of combinations to 
                    \the length of available values"""
                    if len(values_a) < max_: max_ = len(values_a)
                    ll = random.randint(min_, max_)
                else: ll = length

                if ll != ll_prev: gen_comb = True

                """generate combinations only when err limit is reached or 
                the length of combinations has changed"""
                if err_limit_reached and gen_comb: 
                    gen = gen + 1
                    """after reaching the max number of consecutive errors, start generating combinations via itertools"""
                    """generation is done at this point to prevent generation for a very long list"""
                    """generation is also done when when length of a combination changes"""
                    comb_left = set(itertools.combinations(values_a, ll)) - done
                    """break if there are no elements left"""
                    if not len(comb_left): break

                """if combinations has already been generated, used them"""
                if comb_left:
                    """take random sample from the set"""
                    comb = random.sample(comb_left, 1)[0]
                    """remove it from the set"""
                    comb_left.remove(comb)
                    """check if it was the last combination to break the loop at the end"""
                    if not len(comb_left): comb_left_0 = True
                else: 
                    """generate random combination"""
                    comb = tuple(sorted(random.sample(values_a, ll)))

                """set previous length"""
                ll_prev = ll
                """reset gen_comb"""
                gen_comb = False

                """check if combination is new"""
                if comb not in done: found = True
                else:
                    """otherwise, iterate errors"""
                    err = err + 1
                    err_t = err_t + 1
                    found = False
                    if err > err_limit: err_limit_reached = gen_comb = True

                if found:
                    """reset err"""
                    err = 0
                    dd = dd + 1
                    """add combination to done"""    
                    done.add(comb)

                    """increase counter for the combs"""
                    counter[list(comb)] = counter[list(comb)] + 1

                    """check if seekeing the max number of combinations or min"""
                    if mode == 'max':
                        """for max, we must remove the elements which reached the limit"""
                        for l_ in list(comb):
                            if counter[l_] == limit:
                                del values_a[values_a.index(l_)]

                """if length of available elements is smaller than the possible length of the combinations"""
                if isinstance(length, str):
                    """for random length, choose the minimal length"""
                    if len(values_a) < min_: break
                else: 
                    if len(values_a) < ll: break
                """if all elements reached the limit"""
                if mode == 'total':
                    if len(done) >= limit: break
                else: #min, max
                    if all(counter[mask] >= limit): break
                """if the number of consecutive errors reached 
                the total number of combinations, break as you may never 
                draw a valid combination"""
                if err > comb_max: break
                """if it was the last combination left"""
                if comb_left_0: break
    except Exception as e: print(e)
    print('')

    print('Combinations generated: ' + str(dd))        
    print('Total number of iterations: ' + str(ii))
    print('Final value of err: ' + str(err))
    print('Total number of errors: ' + str(err_t))
    print('How many times itertools.combinations was used: ' + str(gen))

    return done

"""range of values to create combinations"""
values = range(256)
"""values excluded from the combinations"""
exclude = [0,255]
"""length of combinations, if is string, the number of layers 
is generated randomly withing (min_, max_) range """
length = 'r'
"""mode of how the combinations are generated:
min: minimal number of times the value appears across all combinations 
(limited down by the limit value)
max: max number of times the value appears across all combinations (limited         
max by the limit value)
total: total number of combinations  (limited the limit value)"""
mode = 'max' 
"""limit used for the mode combinations' generation"""
limit = 1000
"""min_ and max_ are used when length is string, 
length is generated randomly within (min_, max_) range"""
min_ = 4
max_ = 12
done = get_random_combinations(values, exclude, length, mode, limit, min_, max_)

Answer 2

一共n choose k possible combinations meeting your criteria, with n = length and k = com_len. n choose k evaluates to n! / (k! * (n - k)!). If you generate all distinct possibilities, each of the n values appears (n - 1)! / ((k - 1)! * (n - k)!) times (https://math.stackexchange.com/q/26619/295281）。你应该能够解决这个问题，假设 z <= (n - 1)! / ((k - 1)! * (n - k)!)，其中 z = counter_expected.

以你的例子为例：

n = 256
k = 3
z = 100 <= 32385

通常生成组合的一种常用方法是通过长度为 n 的布尔数组步进 k 位，始终递增可能的最低位。每当更高的位增加时，它下面的所有位都会重置为初始位置。这是一个示例序列：

0 0 0 0 3 2 1
0 0 0 3 0 2 1
0 0 0 3 2 0 1
0 0 0 3 2 1 0
0 0 3 0 0 2 1
...
3 2 0 0 1 0 0
3 2 0 1 0 0 0
3 2 1 0 0 0 0

我已经标记了这些位置，这样您就可以看到，如果首先对值进行排序，则组合将始终排序。请记住，您可以将其实现为 n 布尔值或 k 索引的数组。两者各有优缺点。

对于您的特定用例，有一个转折点。一旦计数超过一定数量，您就不要使用一点。有多种方法可以逐步执行这些位，但它们都归结为具有大小 n 计数器数组。

如果 n * z 是 k 的倍数，您将自动获得所有箱子中的准确计数。 n 和 z 本身实际上都不必是 k 的倍数。但是，如果不是这样，您将不可避免地出现下溢或上溢。直观地说，您希望一次生成 n * z 总值的目标，k。很明显，一个必须是后者的倍数才能使这成为可能。

您可以有两种退出条件。给定所有位的总累计计数，s，

s >= n * z：所有位的计数至少为z。最多 k - 1 位的计数为 z + 1.
s > n * z - k：除了最多 k - 1 位之外，所有位的计数都是 z，因此再添加一个组合将导致条件 1.

要讨论的最后一个设计选择是位移动的顺序。由于生成一系列组合会耗尽一个 bin，因此我希望能够使耗尽的 bin 以可预测的顺序在数组存储桶的一侧按顺序累积。这将从算法中删除大量检查。因此，我不会增加可能的最低位，而是增加可能的最高位，并在重置时增加它下面的一位。在那种情况下，耗尽的桶将始终是最低位。

因此，让我们最终停止做出未经证实的听起来像是数学的陈述，并展示一个实现：

def generate_combos(n, k, z):
    full_locs = np.arange(k + 1, dtype=np.uint)
    full_locs[k] = n                      # makes partial vectorization easier
    locs = full_locs[:k]                  # bit indices
    counts = np.zeros(n, dtype=np.uint)   # counter buckets
    values = np.arange(n, dtype=np.uint)  # values
    min_index = 0                         # index of lowest non-exhausted bin

    for _ in range((n * z) // k):
        counts[locs] += 1
        yield values[locs]

        if counts[min_index] == z:
            # if lowest bin filled, shift and reset
            min_index += np.argmax(counts[min_index:] < z)
            locs[:] = min_index + np.arange(k)
        else:
            # otherwise, increment highest available counter
            i = np.flatnonzero(np.diff(full_locs) > 1)
            if i.size:
                i = i[-1]
                locs[i] += 1
                # reset the remainder
                locs[i + 1:] = locs[i] + np.arange(1, k - i)
            else:
                break

这使用了条件2。如果你想要条件1，在后面添加以下行：

if counters[-1] < z:
    yield values[-k:]

将循环更改为类似 for _ in range(-((n * z) // -k)): 的内容（由提供）无济于事，因为计数器并非设计用于处理它。

这里是 IDEOne link 显示 generate_combos(256, 3, 10) 的前一百个元素：

请注意，在前 10 个元素之后，0 和 1 都出现了 10 次。 2 和 3 出现了一次，所以它们只在 9 次迭代后就用完了，依此类推。

Answer 3

这是另一个关注随机方面的答案。

您有 n 选择 k 种可能性，您希望随机抽样以获得每个值的大约 z 次出现（使用我在另一个答案中的符号）。我假设如果您采用 (n * z) // k k 大小的样本，并且您的随机数生成器实际上是统一的，您将自动获得每个元素大约 z 的出现次数。在您的示例中，使用 n=256、k=3 和 z=100，在 8533 个中，256 个 bin 中的分布确实相当均匀。

如果您愿意接受某种程度的均匀性不完善，python 的 random.sample 是一个不错的选择。人口是从零到 n 的所有整数选择 k.

n 选择 k 在这种情况下是 256 * 255 * 254 / 6 = 2763520。这超出了有符号 32 位整数的范围，但很适合无符号整数。更好的是，您可以简单地使用 Python 的无限精度整数。

诀窍是将这些数字映射到唯一的值组合。这是通过 combinatorial number system, as described here.

完成的

from random import sample
from scipy.misc import combs

def gen_samples(n, k, z):
    codes = sample(range(combs(n, k)), (n * z) // k)
    return [unrank(n, k, code) for code in codes]

def unrank(n, k, i):
    """
    Implementation of Lehmer's greedy algorithm left as
    exercise for the reader
    """
    # return k-element sequence

有关取消排名的提示，请参阅 here。

生成具有值出现次数限制的值的组合

Generating comabinations of values with the limit of the number of appearances of the values

python

combinations

appearance