生成具有 n 种组合范围的随机和唯一索引
Generate random and unique indexes with a range for n number of combinations
我想随机搜索参数,但我不知道如何在一定范围内生成随机但唯一的索引组合。例如,我有这些参数:
hyperparams = {
'size': [200, 300, 400],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100],
'window': [4, 5, 7, 10],
'alpha': [0.025, 0.01],
'min_alpha': [0.025, 1e-4],
}
我想生成它们的唯一组合,每个索引都有一个 n 次的范围。
假设它将生成 500 种可能的组合。我只想随机取其中的 100 个,但是在这 100 个中,任何一个都是重复的。
即
random_and_unique_combination=[1,3,2,1,2,1]
哪个...
Index 0 is the size.
Index 1 is min_count.
Index 2 is iter.
and so on...
所以后来我用
访问字典
::
size = hyperparams['size'][1]
min_count = hyperparams['min_count'][3]
iter = hyperparams['iter'][2]
::
::
这是一种基于散列的方法来生成唯一的组合,并不生成所有组合,只生成所需的组合数量 -
num_comb = 100 # number of combinations needed
# We need ordered keys to maintain the indexing needed :
# Index 0 is the size, Index 1 is min_count, Index 2 is iter...
ordered_keys = ['size', 'min_count', 'iter', 'window', 'alpha','min_alpha']
lens = np.array([len(hyperparams[i]) for i in ordered_keys])
prod_lens = lens.cumprod()
idx = np.random.choice(prod_lens[-1], num_comb, replace=0)
N = len(lens)
out = np.zeros((num_comb,N),dtype=int)
r = idx
for i in range(2,N+1):
d = r//prod_lens[-i]
r = r - d*prod_lens[-i]
out[:,-i+1] = d
out[:,0] = r
运行时测试
计时三种方法 post 到目前为止解决了不生成所有组合并且真正随机的问题 - @norok2-Edit1、@scnerd 和一个 posted post 三组输出长度 -
In [442]: hyperparams = {
...: 'size': [200, 300, 400],
...: 'min_count': [1, 2, 3, 4, 5],
...: 'iter': [50, 100],
...: 'window': [4, 5, 7, 10],
...: 'alpha': [0.025, 0.01],
...: 'min_alpha': [0.025, 1e-4],
...: }
In [443]: %timeit norok2_edit1(hyperparams, num=100)
...: %timeit scnerd(hyperparams, num=100)
...: %timeit divakar(hyperparams, num_comb=100)
...:
1000 loops, best of 3: 612 µs per loop
1000 loops, best of 3: 1.03 ms per loop
10000 loops, best of 3: 57.9 µs per loop
In [444]: %timeit norok2_edit1(hyperparams, num=200)
...: %timeit scnerd(hyperparams, num=200)
...: %timeit divakar(hyperparams, num_comb=200)
...:
1000 loops, best of 3: 1.39 ms per loop
100 loops, best of 3: 2 ms per loop
10000 loops, best of 3: 66.5 µs per loop
In [445]: %timeit norok2_edit1(hyperparams, num=400)
...: %timeit scnerd(hyperparams, num=400)
...: %timeit divakar(hyperparams, num_comb=400)
...:
100 loops, best of 3: 4.5 ms per loop
100 loops, best of 3: 4.01 ms per loop
10000 loops, best of 3: 77.5 µs per loop
如果我没理解错的话,你想要一个特定范围内的非重复数字元组序列。
编辑 0:
我相信你最好的选择是先创建所有可能的组合,然后将它们洗牌:
import itertools
import random
def random_unique_combinations_k0(items, k):
# generate all possible combinations
combinations = list(itertools.product(*[item for item in items]))
# shuffle them
random.shuffle(combinations)
for combination in itertools.islice(combinations, k):
yield combination
编辑 1:
如果生成所有组合在内存方面过于昂贵,您可能需要反复试验并拒绝非唯一组合。
一种方法是:
import itertools
import random
import functools
def prod(items):
return functools.reduce(lambda x, y: x * y, items)
def random_unique_combinations_k1(items, k):
max_lens = [len(list(item)) for item in items]
max_num_combinations = prod(max_lens)
# use `set` to ensure uniqueness
index_combinations = set()
# make sure that with the chosen number the next loop can exit
# WARNING: if `k` is too close to the total number of combinations,
# it may take a while until the next valid combination is found
while len(index_combinations) < min(k, max_num_combinations):
index_combinations.add(tuple(
random.randint(0, max_len - 1) for max_len in max_lens))
# make sure their order is shuffled
# (`set` seems to sort its content)
index_combinations = list(index_combinations)
random.shuffle(index_combinations)
for index_combination in itertools.islice(index_combinations, k):
yield tuple(item[i] for i, item in zip(index_combination, items))
(这也可以只用列表来实现,并在添加 combination
之前检查唯一性,也会使 random.shuffle()
变得多余,但从我的测试来看,这些比使用 set
慢s.)
编辑 2:
可能最不消耗内存的方法是实际打乱生成器,然后对它们使用 itertools.product()
。
import random
import itertools
def pseudo_random_unique_combinations_k(items, k):
# randomize generators
comb_gens = list(items)
for i, comb_gen in enumerate(comb_gens):
random.shuffle(list(comb_gens[i]))
# get the first `num` combinations
combinations = list(itertools.islice(itertools.product(*comb_gens), k))
random.shuffle(combinations)
for combination in itertools.islice(combinations, k):
yield tuple(combination)
这显然会牺牲一些随机性。
编辑 3:
根据@Divakar的方法,我又写了一个版本,这个版本看起来比较高效,但很可能会受到random.sample()
.
能力的限制
import random
import functools
def prod(items):
return functools.reduce(lambda x, y: x * y, items)
def random_unique_combinations_k3(items, k):
max_lens = [len(list(item)) for item in items]
max_num_combinations = prod(max_lens)
for i in random.sample(range(max_num_combinations), k):
index_combination = []
for max_len in max_lens:
index_combination.append(i % max_len)
i = i // max_len
yield tuple(item[i] for i, item in zip(index_combination, items))
测试
在请求的输入上,它们执行得相当快,0
方法是最快的(令人惊讶的是甚至比 2
或 pseudo
方法更快),1
方法最慢,3
方法介于两者之间。
sklearn.model_selection.ParameterSampler
方法的速度与方法 1
.
相当
items = [v for k, v in hyperparams.items()]
num = 100
%timeit list(random_unique_combinations_k0(items, num))
615 µs ± 4.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(random_unique_combinations_k1(items, num))
2.51 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
179 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(random_unique_combinations_k3(items, num))
570 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# the `sklearn` method which is slightly different in that it is
# also accessing the underling dictiornary
import from sklearn.model_selection import ParameterSampler
%timeit list(ParameterSampler(hyperparams, n_iter=num))
2.86 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
作为旁注,我会确保你 hyperparams
是 collections.OrderedDict
因为 dict
不能保证在 [=94= 的不同版本中订购].
对于稍大的对象,我们开始看到限制:
items = [range(50)] * 5
num = 1000
%timeit list(random_unique_combinations_k0(items, num))
# Memory Error
%timeit list(random_unique_combinations_k1(items, num))
19.3 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
1.82 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(random_unique_combinations_k3(items, num))
2.31 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
对于较大的对象更是如此:
items = [range(50)] * 50
num = 1000
%timeit list(random_unique_combinations_k0(items, num))
# Memory Error
%timeit list(random_unique_combinations_k1(items, num))
149 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
4.92 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(random_unique_combinations_k3(items, num))
# OverflowError
摘要:
方法 0
可能不适合内存,方法 1
最慢但可能更健壮,方法 3
如果它提供最佳性能不会 运行 出现溢出问题,而方法 2
(pseudo
) 是速度最快且占用内存较少的方法,但它会产生某种程度的 "less random" 组合。
scikit-learn
需要解决这个问题才能实现 RandomizedSearchCV, and they have a separate class ParameterSampler,您可以使用它:
In [1]: from sklearn.model_selection import ParameterSampler
In [2]: list(ParameterSampler({'a': [1,2,3], 'b': ['x', 'y', 'z'], 'c': [0.1, 0.2, 0.3]}, n_iter=5))
Out[2]:
[{'a': 3, 'b': 'z', 'c': 0.2},
{'a': 3, 'b': 'y', 'c': 0.1},
{'a': 3, 'b': 'z', 'c': 0.1},
{'a': 3, 'b': 'x', 'c': 0.2},
{'a': 1, 'b': 'y', 'c': 0.3}]
这不是索引,但您可以通过将值列表替换为索引列表来轻松解决这个小问题:
In [1]: from sklearn.model_selection import ParameterSampler
In [2]: params = {'a': [1,2,3], 'b': ['x', 'y', 'z'], 'c': [0.1, 0.2, 0.3]}
In [3]: param_idxs = {key: list(range(len(vals))) for key, vals in params.items()}
In [4]: list(ParameterSampler(param_idxs, n_iter=5))
Out[4]:
[{'a': 1, 'b': 1, 'c': 0},
{'a': 1, 'b': 0, 'c': 2},
{'a': 0, 'b': 2, 'c': 1},
{'a': 1, 'b': 0, 'c': 1},
{'a': 2, 'b': 0, 'c': 0}]
并且according to the documentation,你不会得到任何重复:
If all parameters are presented as a list, sampling without
replacement is performed.
快速浏览 current source code 表示,当您的所有参数都以列表形式给出时,它会生成所有可能的选项并从中执行随机抽样。在大多数情况下这不会成为问题,但如果您有大量超参数选项,它可能会占用大量内存。
我想随机搜索参数,但我不知道如何在一定范围内生成随机但唯一的索引组合。例如,我有这些参数:
hyperparams = {
'size': [200, 300, 400],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100],
'window': [4, 5, 7, 10],
'alpha': [0.025, 0.01],
'min_alpha': [0.025, 1e-4],
}
我想生成它们的唯一组合,每个索引都有一个 n 次的范围。 假设它将生成 500 种可能的组合。我只想随机取其中的 100 个,但是在这 100 个中,任何一个都是重复的。 即
random_and_unique_combination=[1,3,2,1,2,1]
哪个...
Index 0 is the size.
Index 1 is min_count.
Index 2 is iter.
and so on...
所以后来我用
访问字典::
size = hyperparams['size'][1]
min_count = hyperparams['min_count'][3]
iter = hyperparams['iter'][2]
::
::
这是一种基于散列的方法来生成唯一的组合,并不生成所有组合,只生成所需的组合数量 -
num_comb = 100 # number of combinations needed
# We need ordered keys to maintain the indexing needed :
# Index 0 is the size, Index 1 is min_count, Index 2 is iter...
ordered_keys = ['size', 'min_count', 'iter', 'window', 'alpha','min_alpha']
lens = np.array([len(hyperparams[i]) for i in ordered_keys])
prod_lens = lens.cumprod()
idx = np.random.choice(prod_lens[-1], num_comb, replace=0)
N = len(lens)
out = np.zeros((num_comb,N),dtype=int)
r = idx
for i in range(2,N+1):
d = r//prod_lens[-i]
r = r - d*prod_lens[-i]
out[:,-i+1] = d
out[:,0] = r
运行时测试
计时三种方法 post 到目前为止解决了不生成所有组合并且真正随机的问题 - @norok2-Edit1、@scnerd 和一个 posted post 三组输出长度 -
In [442]: hyperparams = {
...: 'size': [200, 300, 400],
...: 'min_count': [1, 2, 3, 4, 5],
...: 'iter': [50, 100],
...: 'window': [4, 5, 7, 10],
...: 'alpha': [0.025, 0.01],
...: 'min_alpha': [0.025, 1e-4],
...: }
In [443]: %timeit norok2_edit1(hyperparams, num=100)
...: %timeit scnerd(hyperparams, num=100)
...: %timeit divakar(hyperparams, num_comb=100)
...:
1000 loops, best of 3: 612 µs per loop
1000 loops, best of 3: 1.03 ms per loop
10000 loops, best of 3: 57.9 µs per loop
In [444]: %timeit norok2_edit1(hyperparams, num=200)
...: %timeit scnerd(hyperparams, num=200)
...: %timeit divakar(hyperparams, num_comb=200)
...:
1000 loops, best of 3: 1.39 ms per loop
100 loops, best of 3: 2 ms per loop
10000 loops, best of 3: 66.5 µs per loop
In [445]: %timeit norok2_edit1(hyperparams, num=400)
...: %timeit scnerd(hyperparams, num=400)
...: %timeit divakar(hyperparams, num_comb=400)
...:
100 loops, best of 3: 4.5 ms per loop
100 loops, best of 3: 4.01 ms per loop
10000 loops, best of 3: 77.5 µs per loop
如果我没理解错的话,你想要一个特定范围内的非重复数字元组序列。
编辑 0:
我相信你最好的选择是先创建所有可能的组合,然后将它们洗牌:
import itertools
import random
def random_unique_combinations_k0(items, k):
# generate all possible combinations
combinations = list(itertools.product(*[item for item in items]))
# shuffle them
random.shuffle(combinations)
for combination in itertools.islice(combinations, k):
yield combination
编辑 1:
如果生成所有组合在内存方面过于昂贵,您可能需要反复试验并拒绝非唯一组合。 一种方法是:
import itertools
import random
import functools
def prod(items):
return functools.reduce(lambda x, y: x * y, items)
def random_unique_combinations_k1(items, k):
max_lens = [len(list(item)) for item in items]
max_num_combinations = prod(max_lens)
# use `set` to ensure uniqueness
index_combinations = set()
# make sure that with the chosen number the next loop can exit
# WARNING: if `k` is too close to the total number of combinations,
# it may take a while until the next valid combination is found
while len(index_combinations) < min(k, max_num_combinations):
index_combinations.add(tuple(
random.randint(0, max_len - 1) for max_len in max_lens))
# make sure their order is shuffled
# (`set` seems to sort its content)
index_combinations = list(index_combinations)
random.shuffle(index_combinations)
for index_combination in itertools.islice(index_combinations, k):
yield tuple(item[i] for i, item in zip(index_combination, items))
(这也可以只用列表来实现,并在添加 combination
之前检查唯一性,也会使 random.shuffle()
变得多余,但从我的测试来看,这些比使用 set
慢s.)
编辑 2:
可能最不消耗内存的方法是实际打乱生成器,然后对它们使用 itertools.product()
。
import random
import itertools
def pseudo_random_unique_combinations_k(items, k):
# randomize generators
comb_gens = list(items)
for i, comb_gen in enumerate(comb_gens):
random.shuffle(list(comb_gens[i]))
# get the first `num` combinations
combinations = list(itertools.islice(itertools.product(*comb_gens), k))
random.shuffle(combinations)
for combination in itertools.islice(combinations, k):
yield tuple(combination)
这显然会牺牲一些随机性。
编辑 3:
根据@Divakar的方法,我又写了一个版本,这个版本看起来比较高效,但很可能会受到random.sample()
.
import random
import functools
def prod(items):
return functools.reduce(lambda x, y: x * y, items)
def random_unique_combinations_k3(items, k):
max_lens = [len(list(item)) for item in items]
max_num_combinations = prod(max_lens)
for i in random.sample(range(max_num_combinations), k):
index_combination = []
for max_len in max_lens:
index_combination.append(i % max_len)
i = i // max_len
yield tuple(item[i] for i, item in zip(index_combination, items))
测试
在请求的输入上,它们执行得相当快,0
方法是最快的(令人惊讶的是甚至比 2
或 pseudo
方法更快),1
方法最慢,3
方法介于两者之间。
sklearn.model_selection.ParameterSampler
方法的速度与方法 1
.
items = [v for k, v in hyperparams.items()]
num = 100
%timeit list(random_unique_combinations_k0(items, num))
615 µs ± 4.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(random_unique_combinations_k1(items, num))
2.51 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
179 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(random_unique_combinations_k3(items, num))
570 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# the `sklearn` method which is slightly different in that it is
# also accessing the underling dictiornary
import from sklearn.model_selection import ParameterSampler
%timeit list(ParameterSampler(hyperparams, n_iter=num))
2.86 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
作为旁注,我会确保你 hyperparams
是 collections.OrderedDict
因为 dict
不能保证在 [=94= 的不同版本中订购].
对于稍大的对象,我们开始看到限制:
items = [range(50)] * 5
num = 1000
%timeit list(random_unique_combinations_k0(items, num))
# Memory Error
%timeit list(random_unique_combinations_k1(items, num))
19.3 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
1.82 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(random_unique_combinations_k3(items, num))
2.31 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
对于较大的对象更是如此:
items = [range(50)] * 50
num = 1000
%timeit list(random_unique_combinations_k0(items, num))
# Memory Error
%timeit list(random_unique_combinations_k1(items, num))
149 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
4.92 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(random_unique_combinations_k3(items, num))
# OverflowError
摘要:
方法 0
可能不适合内存,方法 1
最慢但可能更健壮,方法 3
如果它提供最佳性能不会 运行 出现溢出问题,而方法 2
(pseudo
) 是速度最快且占用内存较少的方法,但它会产生某种程度的 "less random" 组合。
scikit-learn
需要解决这个问题才能实现 RandomizedSearchCV, and they have a separate class ParameterSampler,您可以使用它:
In [1]: from sklearn.model_selection import ParameterSampler
In [2]: list(ParameterSampler({'a': [1,2,3], 'b': ['x', 'y', 'z'], 'c': [0.1, 0.2, 0.3]}, n_iter=5))
Out[2]:
[{'a': 3, 'b': 'z', 'c': 0.2},
{'a': 3, 'b': 'y', 'c': 0.1},
{'a': 3, 'b': 'z', 'c': 0.1},
{'a': 3, 'b': 'x', 'c': 0.2},
{'a': 1, 'b': 'y', 'c': 0.3}]
这不是索引,但您可以通过将值列表替换为索引列表来轻松解决这个小问题:
In [1]: from sklearn.model_selection import ParameterSampler
In [2]: params = {'a': [1,2,3], 'b': ['x', 'y', 'z'], 'c': [0.1, 0.2, 0.3]}
In [3]: param_idxs = {key: list(range(len(vals))) for key, vals in params.items()}
In [4]: list(ParameterSampler(param_idxs, n_iter=5))
Out[4]:
[{'a': 1, 'b': 1, 'c': 0},
{'a': 1, 'b': 0, 'c': 2},
{'a': 0, 'b': 2, 'c': 1},
{'a': 1, 'b': 0, 'c': 1},
{'a': 2, 'b': 0, 'c': 0}]
并且according to the documentation,你不会得到任何重复:
If all parameters are presented as a list, sampling without replacement is performed.
快速浏览 current source code 表示,当您的所有参数都以列表形式给出时,它会生成所有可能的选项并从中执行随机抽样。在大多数情况下这不会成为问题,但如果您有大量超参数选项,它可能会占用大量内存。