使用对行和列唯一的条目创建数组的 Pythonic 方法

Pythonic way to create array with entries unique for row and column

我有许多样本,想从定义长度的样本中随机选择一个子集并重复此过程,直到每个样本出现 3 次,并且在给定行中没有样本出现两次。

例如:

samples=range(12)
l=6
repeats=3

我希望有 6 行,每行 6 个样本。 我想要这样的东西:

[1, 2, 11, 7, 0, 3]
[2, 5, 0, 7, 10, 3]
[11, 0, 8, 7, 6, 1]
[4, 11, 5, 9, 3, 6]
[4, 9, 8, 1, 10, 2]
[9, 5, 6, 4, 8, 10]

我尝试了以下方法,但它只在一种情况下(偶然)起作用,当样本被平均挑选时,我通常得到

ValueError: sample larger than population

代码:

import random
samples=range(12)
measured={key:0 for key in samples}
while len(samples)>0:
    sample=random.sample(samples,6)
    print sample
    for s in sample:
        measured[s]+=1
        if measured[s]==3:
            samples.remove(s)

我想知道是否有办法 tweek numpy.random.choice 或从 itertools.permutations 但由于上述限制,这些方法不起作用。

是否有我忽略的示例方法,或者我是否需要使用嵌套 loops/ifs?

现在您已经阐明了您想要的内容,这里是我原始答案的修订版,它是基于约束的纯 python 实现。更改原始答案相当容易,因此我还添加了代码来限制迭代次数并在最后打印一份小报告以验证它是否符合所有标准。

from collections import Counter
from itertools import chain
from pprint import pprint
import random


def pick_subset(population, length, repeat, max_iterations=1000000):
    iterations = 0

    while iterations < max_iterations:
        # Get subset where every sample value occurrs at exactly "repeat" times.
        while iterations < max_iterations:
            iterations += 1
            subset = [random.sample(population, length) for i in range(length)]
            measure = Counter(chain.from_iterable(subset))
            if all((iterations == repeat for iterations in measure.values())):
                break

        # Check whether there are no more than 2 repeats in per row.
        if all((all((iterations < 2 for iterations in Counter(row).values()))
                   for row in subset)):
            break

    if iterations >= max_iterations:
        raise RuntimeError("Couldn't match criteria after {:,d}".format(iterations))
    else:
        print('Succeeded after {:,d} iterations'.format(iterations))
        return subset


samples = range(12)
length = 6
repeat = 3

subset = pick_subset(samples, length, repeat)

print('')
print('Selected subset:')
pprint(subset)

# Show that each sample occurs exactly three times.
freq_counts = Counter(chain.from_iterable(subset))
print('')
print('Overall sample frequency counts:')
print(', '.join(
        '{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items()))


# Show that no sample occurs more than twice in a each row.
print('')
print('Sample frequency counts per row:')
for i, row in enumerate(subset):
    freq_counts = Counter(row)
    print('  row[{}]: {}'.format(i, ', '.join(
            '{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items())))

示例输出:

Succeeded after 123,847 iterations

Selected subset:
[[4, 9, 10, 2, 5, 7],
 [5, 8, 6, 0, 11, 1],
 [1, 8, 3, 10, 7, 0],
 [7, 3, 2, 4, 11, 9],
 [0, 10, 11, 6, 1, 2],
 [8, 3, 9, 4, 6, 5]]

Overall sample frequency counts:
 0: 3,  1: 3,  2: 3,  3: 3,  4: 3,  5: 3,  6: 3,  7: 3,  8: 3,  9: 3, 10: 3, 11: 3

Sample frequency counts per row:
  row[0]:  2: 1,  4: 1,  5: 1,  7: 1,  9: 1, 10: 1
  row[1]:  0: 1,  1: 1,  5: 1,  6: 1,  8: 1, 11: 1
  row[2]:  0: 1,  1: 1,  3: 1,  7: 1,  8: 1, 10: 1
  row[3]:  2: 1,  3: 1,  4: 1,  7: 1,  9: 1, 11: 1
  row[4]:  0: 1,  1: 1,  2: 1,  6: 1, 10: 1, 11: 1
  row[5]:  3: 1,  4: 1,  5: 1,  6: 1,  8: 1,  9: 1

我可能误会了,但根据你的标题,你实际上想要 samples 中满足以下条件的数字网格:

  1. 每行每列的条目都是唯一的
  2. samples中的每个元素最多重复repeats

我认为没有简单的方法可以做到这一点,因为网格中的每个元素都依赖于网格中的其他项目。

一个可能的解决方案是一次填充您的网格一个元素,从第一个元素(左上角)到最后一个元素(右下角)蜿蜒。在每个位置,您从一组 "valid" 值中随机选择,这些值将是尚未为该行或列选择的值以及尚未选择 repeats 次的值.

但是,这种方法并不能保证每次都能找到解决方案。您可以定义一个函数来搜索排列,直到找到一个。

这是我使用 numpy:

想出的一种实现
import numpy as np

samples=range(12)
l=6
repeats=3

def try_make_grid(samples, l, repeats, max_tries=10):
    try_number = 0
    while(try_number < max_tries):
        try:
            # initialize lxl grid to nan
            grid = np.zeros((l, l))*np.nan

            counts = {s: 0 for s in samples}  # counts of each sample
            count_exhausted = set()           # which samples have been exhausted
            for i in range(l):
                for j in range(l):
                    # can't use values that already happened in this row or column
                    invalid_values = set(np.concatenate([grid[:,j], grid[i,:]]))
                    valid_values = [
                        v for v in samples if v not in invalid_values|count_exhausted
                    ]
                    this_choice = np.random.choice(a=valid_values)
                    grid[i,j] = this_choice

                    # update the count and check to see if this_choice is now exhausted
                    counts[this_choice] += 1
                    if counts[this_choice] >= repeats:
                        count_exhausted.add(this_choice)
            print("Successful on try number %d" % try_number)
            return grid
        except:
            try_number += 1
    print("Unsuccessful")

示例网格:

np.random.seed(42)
grid = try_make_grid(samples, l, repeats)
#Successful on try number 6
print(grid)
#[[10.  5.  8. 11.  3.  0.]
# [ 0. 11.  4.  8.  2.  5.]
# [ 1.  6.  0.  2.  7.  3.]
# [ 3.  2.  7. 10. 11.  9.]
# [ 4.  1.  9.  6.  8.  7.]
# [ 6.  9. 10.  5.  1.  4.]]

如您所见,每一行和每一列都是唯一的,并且每个值的选择次数不超过 repeats 次(在本例中,它们都被恰好选择了 repeats 次)。

from collections import Counter
print(Counter(grid.ravel()))
#Counter({10.0: 3,
#         5.0: 3,
#         8.0: 3,
#         11.0: 3,
#         3.0: 3,
#         0.0: 3,
#         4.0: 3,
#         2.0: 3,
#         1.0: 3,
#         6.0: 3,
#         7.0: 3,
#         9.0: 3})