迭代计算列表中的元素并将计数存储在字典中

Question

我有一段代码循环遍历一组节点并计算将给定节点连接到网络中每个其他节点的路径长度。对于每个节点，我的代码 returns 是一个列表，b 包含整数值，为每个可能的连接提供路径长度。我想计算给定路径长度的出现次数，以便创建直方图。

local_path_length_hist = {}
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        if dist in local_path_length_hist:
            local_path_length_hist[dist]+=1
        else:
            local_path_length_hist[dist]=1

就字典更新而言，这大概是非常粗糙的编码。有更好的方法吗？创建此直方图的最有效方法是什么？

Answer 1

检查元素是否存在于 dict 中并不是真正必要的。您可以只使用 collections.defaultdict。它的初始化接受可调用对象（如函数），如果您想访问（或分配一些东西）不存在的元素以生成值（即生成默认值的函数），将调用该对象。对于您的情况，它可以只是 int。即

import collections
local_path_length_hist = collections.defaultdict(int)
# you could say collections.defaultdict(lambda : 0) instead
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        local_path_length_hist[dist] += 1

你可以这样把最后两行合二为一，但真的没有意义。

Answer 2

因为 gt.shortest_distance returns 一个 ndarray，numpy 数学是最快的：

max_dist = len(vertices) - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length) 
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    hist += np.bincount(dist.a.clip(max=no_path_dist))

我使用 ndarray 方法 clip 将 gt.shortest_distance 返回的 2147483647 值装箱在 hist 的最后位置。如果不使用 clip，hist's size 在 64 位 Python 上必须是 2147483647 + 1，否则 bincount 会产生 ValueError 在 32 位上 Python。所以 hist 的最后位置将包含所有非路径的计数；您可以在直方图分析中忽略此值。

如以下时间所示，使用 numpy 数学获得直方图比使用 defaultdicts 或 counters（Python 3.4):

# vertices      numpy    defaultdict    counter
    9000       0.83639    38.48990     33.56569
   25000       8.57003    314.24265    262.76025
   50000      26.46427   1303.50843   1111.93898

我的计算机速度太慢，无法使用 9 * (10**6) 个顶点进行测试，但对于不同数量的顶点（正如我们所期望的），相对时间似乎非常一致。

时序码:

from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
from timeit import repeat

# construct distance ndarray such that:
# a) 1/3 of values represent no path
# b) 2/3 of values are a random integer value [0, (num_vertices - 1)]
num_vertices = 50000
no_path_length = 2147483647
distances = []
for _ in range(num_vertices):
    rand_dist = randint(0,(num_vertices-1))
    distances.append(choice((no_path_length, rand_dist, rand_dist)))
dist_a = np.array(distances)

def use_numpy_math():
    max_dist = num_vertices - 1
    hist_length = max_dist + 2
    no_path_dist = max_dist + 1
    hist = np.zeros(hist_length, dtype=np.int)
    for _ in range(num_vertices):
        hist += np.bincount(dist_a.clip(max=no_path_dist))

def use_default_dict():
    d = defaultdict(int)
    for _ in range(num_vertices):
        for dist in dist_a:
            d[dist] += 1

def use_counter():
    hist = Counter()
    for _ in range(num_vertices):
        hist.update(dist_a)

t1 = min(repeat(stmt='use_numpy_math()', setup='from __main__ import use_numpy_math',
                repeat=3, number=1))
t2 = min(repeat(stmt='use_default_dict()', setup='from __main__ import use_default_dict',
                repeat= 3, number=1))
t3 = min(repeat(stmt='use_counter()', setup='from __main__ import use_counter',
                repeat= 3, number=1))

print('%0.5f, %0.5f. %0.5f' % (t1, t2, t3))

Answer 3

我认为您可以完全绕过这段代码。您的问题带有 graph-tool. Take a look at this section of their documentation: graph_tool.stats.vertex_hist 标记。

链接文档摘录：

graph_tool.stats.vertex_hist(g, deg, bins=[0, 1], float_count=True)
Return the vertex histogram of the given degree type or property.

Parameters:
g : Graph Graph to be used.
deg : string or PropertyMap
Degree or property to be used for the histogram. It can be either “in”, “out” or “total”, for in-,
out-, or total degree of the vertices. It can also be a vertex property map.
bins : list of bins (optional, default: [0, 1])
List of bins to be used for the histogram. The values given represent the edges of the bins
(i.e. lower and upper bounds). If the list contains two values, this will be used to automatically
create an appropriate bin range, with a constant width given by the second value, and starting
from the first value.
float_count : bool (optional, default: True)
If True, the counts in each histogram bin will be returned as floats. If False, they will be
returned as integers.

Returns: counts : ndarray
The bin counts.
bins : ndarray
The bin edges.

这会将 return 边缘分组为 ndarray 中的直方图。然后，您只需获取 ndarray 列的长度即可生成直方图。

Answer 4

collections 模块中有一个名为 Counter 的实用程序。这比使用 defaultdict(int)

更干净

from collections import Counter
hist = Counter()
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    hist.update(b)

迭代计算列表中的元素并将计数存储在字典中

Iteratively count elements in list and store count in dictionary

python

performance

dictionary

list

graph-tool