Return 半随机排序的加权对象列表

Return list of weighted objects with semi-randomized ranking

假设我有一个对象列表(在 Python 中)看起来像这样(包含一个标识符和一个 ranking/weighting):

objects = [
    ("object_1", 0.50),
    ("object_2", 0.75),
    ("object_3", 0.25),
    ("object_4", 0.01),
    ("object_5", 0.99),
]

我想 return 同样的 objects 数组,但按其权重的半随机顺序排列。也就是说,我并不总是想要return:

[
    ("object_5", 0.99),
    ("object_2", 0.75),
    ("object_1", 0.50),
    ("object_3", 0.25),
    ("object_4", 0.01),
]

但宁愿允许 某些 非确定性,因此,一般来说,returned 数组 看起来像 以上但也可能看起来像:

[
    ("object_5", 0.99),
    ("object_1", 0.50),
    ("object_2", 0.75),
    ("object_4", 0.01),
    ("object_3", 0.25),
]

编辑:我 认为 我问的问题与 不同,因为这里的顺序很重要;在另一个问题中,顺序无关紧要(同样,我认为!)。

如果我没记错的话,一种方法是在不放回的情况下对样本进行加权:

from random import choices


def weighted_sample_without_replacement(population, weights, k=1):
    #    
    weights = list(weights)
    positions = range(len(population))
    indices = []
    while True:
        needed = k - len(indices)
        if not needed:
            break
        for i in choices(positions, weights, k=needed):
            if weights[i]:
                weights[i] = 0.0
                indices.append(i)
    return [population[i] for i in indices]


data = [
    ("object_5", 0.99),
    ("object_2", 0.75),
    ("object_1", 0.50),
    ("object_3", 0.25),
    ("object_4", 0.01),
]

_, weights = zip(*data)
sample = weighted_sample_without_replacement(data, weights, k=len(data))
print(sample)

输出(单个运行)

[('object_2', 0.75), ('object_5', 0.99), ('object_3', 0.25), ('object_1', 0.5), ('object_4', 0.01)]

一项基本的实验分析似乎验证了我的假设:

from collections import defaultdict
from operator import itemgetter

_, weights = zip(*data)
counts = defaultdict(lambda : defaultdict(int))
for _ in range(1000):
    sample = weighted_sample_without_replacement(data, weights, k=len(data))
    for i, (key, _) in enumerate(sample):
        counts[i][key] += 1

for key, values in counts.items():
    print(key, sorted(values.items(), key=itemgetter(1), reverse=True))

输出 (实验)

0 [('object_5', 415), ('object_2', 290), ('object_1', 186), ('object_3', 106), ('object_4', 3)]
1 [('object_2', 322), ('object_5', 309), ('object_1', 241), ('object_3', 119), ('object_4', 9)]
2 [('object_1', 319), ('object_2', 259), ('object_3', 209), ('object_5', 199), ('object_4', 14)]
3 [('object_3', 533), ('object_1', 239), ('object_2', 126), ('object_5', 75), ('object_4', 27)]
4 [('object_4', 947), ('object_3', 33), ('object_1', 15), ('object_2', 3), ('object_5', 2)]

'object_5' 在 1000 次中有 724 次位于前两个位置,而 'object_4' 在 1000 次中有 947 次位于最后一个位置。为了更好地可视化结果,请参阅下图(可视化是由额外的 运行 实验设置生成的):

可以找到重现实验的代码 here

如果您能够确保 weight 值始终在 [0, 1) 之间,那么以下代码将起作用!

from random import random


def weighted_sample_without_replacement(
    population: List[Tuple[Any, float]], weights: tuple
) -> List[Tuple[Any, float]]:
    return sorted(population, key=lambda x: x[1] * random())

其中 population 看起来像:

[
    ("object_5", 0.99),
    ("object_2", 0.75),
    ("object_1", 0.50),
    ("object_3", 0.25),
    ("object_4", 0.01),
]

weights 喜欢:

(
    0.99,
    0.75,
    0.50,
    0.25,
    0.01,
)