Python: 从共现矩阵创建无向加权图

Question

我正在使用 Python 2.7 创建一个项目，该项目将使用 Twitter 数据并对其进行分析。主要概念是收集推文并获取该推文集合中使用的最常见的主题标签，然后我需要创建一个图表，其中主题标签将是节点。如果这些主题标签恰好出现在同一条推文中，那将是图中的一条边，而该边的权重将是共现数。所以我正在尝试使用 defaultdict(lambda : defaultdict(int)) 创建字典字典并使用 networkx.from_dict_of_dicts

创建图表

我创建共现矩阵的代码是

def coocurrence (common_entities):


com = defaultdict(lambda : defaultdict(int))

# Build co-occurrence matrix
for i in range(len(common_entities)-1):            
    for j in range(i+1, len(common_entities)):
        w1, w2 = sorted([common_entities[i], common_entities[j]])                
        if w1 != w2:
            com[w1][w2] += 1


return com

但是为了使用 networkx.from_dict_of_dicts 我需要它采用这种格式：com= {0: {1:{'weight':1}}}

你有什么办法可以解决这个问题吗？或者用不同的方式创建这样的图表？

Answer 1

首先，我会先对实体进行排序，这样您就不会在循环中不断地运行ning 排序。然后我会使用 itertools.combinations 来获得组合。您需要进行这些更改的直接翻译是：

from itertools import combinations
from collections import defaultdict


def coocurrence (common_entities):

    com = defaultdict(lambda : defaultdict(lambda: {'weight':0}))

    # Build co-occurrence matrix
    for w1, w2 in combinations(sorted(common_entities), 2):
        if w1 != w2:
            com[w1][w2]['weight'] += 1

    return com

print coocurrence('abcaqwvv')

首先构建其他内容然后在第二个循环中生成最终答案可能更有效（更少的索引和创建的对象）。第二个循环不会运行与第一个循环一样多，因为所有计数都已计算。此外，由于第二个循环没有运行循环那么多，因此将 if statement 推迟到第二个循环可能会节省更多时间。与往常一样，运行如果您愿意，可以在多种变体上计时，但这是两个循环解决方案的一个可能示例：

def coocurrence (common_entities):

    com = defaultdict(int)

    # Build co-occurrence matrix
    for w1, w2 in combinations(sorted(common_entities), 2):
        com[w1, w2] += 1

    result = defaultdict(dict)
    for (w1, w2), count in com.items():
        if w1 != w2:
            result[w1][w2] = {'weight': count}
    return result

print coocurrence('abcaqwvv')

Answer 2

This is the working code and best

def coocurrence(*inputs):
com = defaultdict(int)

for named_entities in inputs:
    # Build co-occurrence matrix
    for w1, w2 in combinations(sorted(named_entities), 2):
        com[w1, w2] += 1
        com[w2, w1] += 1  #Including both directions

result = defaultdict(dict)
for (w1, w2), count in com.items():
    if w1 != w2:
        result[w1][w2] = {'weight': count}
return result

Python: 从共现矩阵创建无向加权图

Python: creating undirected weighted graph from a co-occurrence matrix

python

graph

matrix

networkx

defaultdict