使用对行和列唯一的条目创建数组的 Pythonic 方法
Pythonic way to create array with entries unique for row and column
我有许多样本,想从定义长度的样本中随机选择一个子集并重复此过程,直到每个样本出现 3 次,并且在给定行中没有样本出现两次。
例如:
samples=range(12)
l=6
repeats=3
我希望有 6 行,每行 6 个样本。
我想要这样的东西:
[1, 2, 11, 7, 0, 3]
[2, 5, 0, 7, 10, 3]
[11, 0, 8, 7, 6, 1]
[4, 11, 5, 9, 3, 6]
[4, 9, 8, 1, 10, 2]
[9, 5, 6, 4, 8, 10]
我尝试了以下方法,但它只在一种情况下(偶然)起作用,当样本被平均挑选时,我通常得到
ValueError: sample larger than population
代码:
import random
samples=range(12)
measured={key:0 for key in samples}
while len(samples)>0:
sample=random.sample(samples,6)
print sample
for s in sample:
measured[s]+=1
if measured[s]==3:
samples.remove(s)
我想知道是否有办法 tweek numpy.random.choice
或从 itertools.permutations
但由于上述限制,这些方法不起作用。
是否有我忽略的示例方法,或者我是否需要使用嵌套 loops/ifs?
现在您已经阐明了您想要的内容,这里是我原始答案的修订版,它是基于约束的纯 python 实现。更改原始答案相当容易,因此我还添加了代码来限制迭代次数并在最后打印一份小报告以验证它是否符合所有标准。
from collections import Counter
from itertools import chain
from pprint import pprint
import random
def pick_subset(population, length, repeat, max_iterations=1000000):
iterations = 0
while iterations < max_iterations:
# Get subset where every sample value occurrs at exactly "repeat" times.
while iterations < max_iterations:
iterations += 1
subset = [random.sample(population, length) for i in range(length)]
measure = Counter(chain.from_iterable(subset))
if all((iterations == repeat for iterations in measure.values())):
break
# Check whether there are no more than 2 repeats in per row.
if all((all((iterations < 2 for iterations in Counter(row).values()))
for row in subset)):
break
if iterations >= max_iterations:
raise RuntimeError("Couldn't match criteria after {:,d}".format(iterations))
else:
print('Succeeded after {:,d} iterations'.format(iterations))
return subset
samples = range(12)
length = 6
repeat = 3
subset = pick_subset(samples, length, repeat)
print('')
print('Selected subset:')
pprint(subset)
# Show that each sample occurs exactly three times.
freq_counts = Counter(chain.from_iterable(subset))
print('')
print('Overall sample frequency counts:')
print(', '.join(
'{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items()))
# Show that no sample occurs more than twice in a each row.
print('')
print('Sample frequency counts per row:')
for i, row in enumerate(subset):
freq_counts = Counter(row)
print(' row[{}]: {}'.format(i, ', '.join(
'{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items())))
示例输出:
Succeeded after 123,847 iterations
Selected subset:
[[4, 9, 10, 2, 5, 7],
[5, 8, 6, 0, 11, 1],
[1, 8, 3, 10, 7, 0],
[7, 3, 2, 4, 11, 9],
[0, 10, 11, 6, 1, 2],
[8, 3, 9, 4, 6, 5]]
Overall sample frequency counts:
0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3
Sample frequency counts per row:
row[0]: 2: 1, 4: 1, 5: 1, 7: 1, 9: 1, 10: 1
row[1]: 0: 1, 1: 1, 5: 1, 6: 1, 8: 1, 11: 1
row[2]: 0: 1, 1: 1, 3: 1, 7: 1, 8: 1, 10: 1
row[3]: 2: 1, 3: 1, 4: 1, 7: 1, 9: 1, 11: 1
row[4]: 0: 1, 1: 1, 2: 1, 6: 1, 10: 1, 11: 1
row[5]: 3: 1, 4: 1, 5: 1, 6: 1, 8: 1, 9: 1
我可能误会了,但根据你的标题,你实际上想要 samples
中满足以下条件的数字网格:
- 每行每列的条目都是唯一的
samples
中的每个元素最多重复repeats
次
我认为没有简单的方法可以做到这一点,因为网格中的每个元素都依赖于网格中的其他项目。
一个可能的解决方案是一次填充您的网格一个元素,从第一个元素(左上角)到最后一个元素(右下角)蜿蜒。在每个位置,您从一组 "valid" 值中随机选择,这些值将是尚未为该行或列选择的值以及尚未选择 repeats
次的值.
但是,这种方法并不能保证每次都能找到解决方案。您可以定义一个函数来搜索排列,直到找到一个。
这是我使用 numpy
:
想出的一种实现
import numpy as np
samples=range(12)
l=6
repeats=3
def try_make_grid(samples, l, repeats, max_tries=10):
try_number = 0
while(try_number < max_tries):
try:
# initialize lxl grid to nan
grid = np.zeros((l, l))*np.nan
counts = {s: 0 for s in samples} # counts of each sample
count_exhausted = set() # which samples have been exhausted
for i in range(l):
for j in range(l):
# can't use values that already happened in this row or column
invalid_values = set(np.concatenate([grid[:,j], grid[i,:]]))
valid_values = [
v for v in samples if v not in invalid_values|count_exhausted
]
this_choice = np.random.choice(a=valid_values)
grid[i,j] = this_choice
# update the count and check to see if this_choice is now exhausted
counts[this_choice] += 1
if counts[this_choice] >= repeats:
count_exhausted.add(this_choice)
print("Successful on try number %d" % try_number)
return grid
except:
try_number += 1
print("Unsuccessful")
示例网格:
np.random.seed(42)
grid = try_make_grid(samples, l, repeats)
#Successful on try number 6
print(grid)
#[[10. 5. 8. 11. 3. 0.]
# [ 0. 11. 4. 8. 2. 5.]
# [ 1. 6. 0. 2. 7. 3.]
# [ 3. 2. 7. 10. 11. 9.]
# [ 4. 1. 9. 6. 8. 7.]
# [ 6. 9. 10. 5. 1. 4.]]
如您所见,每一行和每一列都是唯一的,并且每个值的选择次数不超过 repeats
次(在本例中,它们都被恰好选择了 repeats
次)。
from collections import Counter
print(Counter(grid.ravel()))
#Counter({10.0: 3,
# 5.0: 3,
# 8.0: 3,
# 11.0: 3,
# 3.0: 3,
# 0.0: 3,
# 4.0: 3,
# 2.0: 3,
# 1.0: 3,
# 6.0: 3,
# 7.0: 3,
# 9.0: 3})
我有许多样本,想从定义长度的样本中随机选择一个子集并重复此过程,直到每个样本出现 3 次,并且在给定行中没有样本出现两次。
例如:
samples=range(12)
l=6
repeats=3
我希望有 6 行,每行 6 个样本。 我想要这样的东西:
[1, 2, 11, 7, 0, 3]
[2, 5, 0, 7, 10, 3]
[11, 0, 8, 7, 6, 1]
[4, 11, 5, 9, 3, 6]
[4, 9, 8, 1, 10, 2]
[9, 5, 6, 4, 8, 10]
我尝试了以下方法,但它只在一种情况下(偶然)起作用,当样本被平均挑选时,我通常得到
ValueError: sample larger than population
代码:
import random
samples=range(12)
measured={key:0 for key in samples}
while len(samples)>0:
sample=random.sample(samples,6)
print sample
for s in sample:
measured[s]+=1
if measured[s]==3:
samples.remove(s)
我想知道是否有办法 tweek numpy.random.choice
或从 itertools.permutations
但由于上述限制,这些方法不起作用。
是否有我忽略的示例方法,或者我是否需要使用嵌套 loops/ifs?
现在您已经阐明了您想要的内容,这里是我原始答案的修订版,它是基于约束的纯 python 实现。更改原始答案相当容易,因此我还添加了代码来限制迭代次数并在最后打印一份小报告以验证它是否符合所有标准。
from collections import Counter
from itertools import chain
from pprint import pprint
import random
def pick_subset(population, length, repeat, max_iterations=1000000):
iterations = 0
while iterations < max_iterations:
# Get subset where every sample value occurrs at exactly "repeat" times.
while iterations < max_iterations:
iterations += 1
subset = [random.sample(population, length) for i in range(length)]
measure = Counter(chain.from_iterable(subset))
if all((iterations == repeat for iterations in measure.values())):
break
# Check whether there are no more than 2 repeats in per row.
if all((all((iterations < 2 for iterations in Counter(row).values()))
for row in subset)):
break
if iterations >= max_iterations:
raise RuntimeError("Couldn't match criteria after {:,d}".format(iterations))
else:
print('Succeeded after {:,d} iterations'.format(iterations))
return subset
samples = range(12)
length = 6
repeat = 3
subset = pick_subset(samples, length, repeat)
print('')
print('Selected subset:')
pprint(subset)
# Show that each sample occurs exactly three times.
freq_counts = Counter(chain.from_iterable(subset))
print('')
print('Overall sample frequency counts:')
print(', '.join(
'{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items()))
# Show that no sample occurs more than twice in a each row.
print('')
print('Sample frequency counts per row:')
for i, row in enumerate(subset):
freq_counts = Counter(row)
print(' row[{}]: {}'.format(i, ', '.join(
'{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items())))
示例输出:
Succeeded after 123,847 iterations
Selected subset:
[[4, 9, 10, 2, 5, 7],
[5, 8, 6, 0, 11, 1],
[1, 8, 3, 10, 7, 0],
[7, 3, 2, 4, 11, 9],
[0, 10, 11, 6, 1, 2],
[8, 3, 9, 4, 6, 5]]
Overall sample frequency counts:
0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3
Sample frequency counts per row:
row[0]: 2: 1, 4: 1, 5: 1, 7: 1, 9: 1, 10: 1
row[1]: 0: 1, 1: 1, 5: 1, 6: 1, 8: 1, 11: 1
row[2]: 0: 1, 1: 1, 3: 1, 7: 1, 8: 1, 10: 1
row[3]: 2: 1, 3: 1, 4: 1, 7: 1, 9: 1, 11: 1
row[4]: 0: 1, 1: 1, 2: 1, 6: 1, 10: 1, 11: 1
row[5]: 3: 1, 4: 1, 5: 1, 6: 1, 8: 1, 9: 1
我可能误会了,但根据你的标题,你实际上想要 samples
中满足以下条件的数字网格:
- 每行每列的条目都是唯一的
samples
中的每个元素最多重复repeats
次
我认为没有简单的方法可以做到这一点,因为网格中的每个元素都依赖于网格中的其他项目。
一个可能的解决方案是一次填充您的网格一个元素,从第一个元素(左上角)到最后一个元素(右下角)蜿蜒。在每个位置,您从一组 "valid" 值中随机选择,这些值将是尚未为该行或列选择的值以及尚未选择 repeats
次的值.
但是,这种方法并不能保证每次都能找到解决方案。您可以定义一个函数来搜索排列,直到找到一个。
这是我使用 numpy
:
import numpy as np
samples=range(12)
l=6
repeats=3
def try_make_grid(samples, l, repeats, max_tries=10):
try_number = 0
while(try_number < max_tries):
try:
# initialize lxl grid to nan
grid = np.zeros((l, l))*np.nan
counts = {s: 0 for s in samples} # counts of each sample
count_exhausted = set() # which samples have been exhausted
for i in range(l):
for j in range(l):
# can't use values that already happened in this row or column
invalid_values = set(np.concatenate([grid[:,j], grid[i,:]]))
valid_values = [
v for v in samples if v not in invalid_values|count_exhausted
]
this_choice = np.random.choice(a=valid_values)
grid[i,j] = this_choice
# update the count and check to see if this_choice is now exhausted
counts[this_choice] += 1
if counts[this_choice] >= repeats:
count_exhausted.add(this_choice)
print("Successful on try number %d" % try_number)
return grid
except:
try_number += 1
print("Unsuccessful")
示例网格:
np.random.seed(42)
grid = try_make_grid(samples, l, repeats)
#Successful on try number 6
print(grid)
#[[10. 5. 8. 11. 3. 0.]
# [ 0. 11. 4. 8. 2. 5.]
# [ 1. 6. 0. 2. 7. 3.]
# [ 3. 2. 7. 10. 11. 9.]
# [ 4. 1. 9. 6. 8. 7.]
# [ 6. 9. 10. 5. 1. 4.]]
如您所见,每一行和每一列都是唯一的,并且每个值的选择次数不超过 repeats
次(在本例中,它们都被恰好选择了 repeats
次)。
from collections import Counter
print(Counter(grid.ravel()))
#Counter({10.0: 3,
# 5.0: 3,
# 8.0: 3,
# 11.0: 3,
# 3.0: 3,
# 0.0: 3,
# 4.0: 3,
# 2.0: 3,
# 1.0: 3,
# 6.0: 3,
# 7.0: 3,
# 9.0: 3})