有效地从列表中提取元素而不重复

Question

我需要从列表中不重复地提取值的确切数量（从累积分布中提取）（并检查来自 networkx 有向图的另一个约束）。

我的列表有超过 100000 个元素（来自 networkx DiGraph 的某些节点）

我的代码如下所示：

def extractRandomSubset(alist,number,G #the graph):
    res={}
    count=0
    while count<number:
        el=random.choice(alist) #number extraction
        if el not in res.values() and G.nodes[el]["weight"]!=0: #my two constraints
            res[count]=el
            count+=1                    
    return res

我的代码需要 0.001 秒到 0.5 秒，我需要执行 100000 次，所以会花费很长时间。有没有办法将其提高到运行更快？

编辑：我使用的类型都可以更改（例如 alist 也可以是其他一些数据类型），我输入的是对部分数据（较小的样本）工作良好的类型.唯一的问题是我需要对结果进行排序（所以设置不是一个好的选择）

Answer 1

最简单的解决方案是打乱整个列表并取符合第二个条件的前 count 个值：

def extractRandomSubset(_list,number,G):
    from random import shuffle
    shuffle(_list) # now items in _list are shuffled randomly
    ret = []
    for el in _list:
        if len(ret) >= number:
            return ret
        if G.nodes[el]["weight"]!=0:
            ret.append(el)
    # handle the case where you don't have enough nodes with nonzero weights in G
    raise ValueError('Not enough nodes with nonzero weight!')

同样想法的另一个版本首先过滤掉所有没有正权重的值然后洗牌：

def extractRandomSubset(_list,number,G):
    from random import shuffle
    valid_items = [el for el in _list if G.nodes[el]["weight"]!=0]
    shuffle(valid_items)
    if len(valid_items) < number:
        raise ValueError('Not enough nodes with nonzero weight!')
    return valid_items[:number]

就个人而言，我更喜欢这个选项，因为：

语法更加清晰，代码中表达的意图清晰
通过一次不向列表中添加一项来避免多次分配（这可能是减速的一个重要来源）

哪个最快完全取决于 _list、G 中元素的数量和 number 的值（我希望大列表和低 number 值，第一个选项可能更快）。

第二个版本的代码也可以使用random.sample()代替shuffle + take head来实现（是等价的但是可能会有不同的表现，具体取决于具体的参数值）：

def extractRandomSubset(_list,number,G):
    from random import shuffle, sample
    valid_items = [el for el in _list if G.nodes[el]["weight"]!=0]
    return sample(valid_items, number)

PS：调用变量 list 是个坏主意（因此我使用 _list 代替），或任何其他掩盖 python keyworkds/types.

有效地从列表中提取元素而不重复

Extract element from a list without repetition efficiently

python

list

networkx