在 NetworkX 中按年分组以计算每年的连接数

Group by year in NetworkX to calculate annual number of connections

我有一个包含两个 ID 和年份的数据框。同一行中的 ID 表示存在连接。我想按年份分组以计算某个 ID 每年的总连接数。

我使用 NetworkX 计算连接数,只考虑 ID1 和 ID2,但不知道如何按年份分组。

import pandas as pd
import networkx as nx
d = {'ID1': [21, 21, 21, 21, 21], 'ID2': [343252, 44134, 41314, 161345, 89479],'year': [2010, 2010, 2010, 2011, 2011]}
df = pd.DataFrame(data=d)
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'ID1', 'ID2')
dict = {}
for x in G.nodes:
    dict[x] = len(G[x])
s = pd.Series(dict, name='connections')
df1 = s.to_frame().sort_values('connections', ascending=False)

这为我提供了与年份无关的连接数。

我想做的是为数据集中存在的每一年创建一个图表(数据跨越 30 年),并计算那一年的连接,并将其添加到数据库中。考虑到我有一个相当大的变量数据库,我可以添加任何修改来做到这一点吗?我考虑过按年创建一个循环来对数据进行排序并为每个数据创建一个图表,但是由于我有数百万行,所以效率很低。

2 条修改方法的建议:

def count_connects(sdf):
    G = nx.from_pandas_edgelist(sdf, "ID1", "ID2")
    return pd.DataFrame.from_dict(
        {n: len(G[n]) for n in G.nodes}, orient="index"
    )

# Version 1
df_connects = (
    df.groupby("year").apply(count_connects)
      .reset_index(level=1)
      .rename(columns={"level_1": "node", 0: "connections"})
)

# Version 2
df_connects = pd.concat(
    [
        count_connects(sdf).rename(columns={0: year})
        for year, sdf in df.groupby("year", as_index=False)
    ],
    axis="columns"
)

示例数据框的结果:

        node  connections
year                     
2010      21            3
2010  343252            1
2010   44134            1
2010   41314            1
2011      21            2
2011  161345            1
2011   89479            1
        2010  2011
21       3.0   2.0
41314    1.0   NaN
44134    1.0   NaN
89479    NaN   1.0
161345   NaN   1.0
343252   1.0   NaN

为了了解需要多长时间,我尝试了以下方法(1,000 种可能的 ID1s,10,000 种可能的 ID2s,2 年,总计 200 万行) :

from random import randint
from time import perf_counter

num_nodes_1 = 1_000
num_nodes_2 = 10_000
num_years = 2
start_year = 1999
num_rows_per_year = 1_000_000

df = pd.DataFrame(
    [
        [randint(1, num_nodes_1), randint(1, num_nodes_2), start_year + year]
        for year in range(num_years)
        for _ in range(num_rows_per_year)
    ],
    columns=["ID1", "ID2", "year"]
)
print(df)

start = perf_counter()
df_connects = (
    df.groupby("year").apply(count_connects)
      .reset_index(level=1)
      .rename(columns={"level_1": "node", 0: "connections"})
)
end = perf_counter()
print(f"Duration version 1: {end - start:.2f} seconds")

start = perf_counter()
df_connects = pd.concat(
    [
        count_connects(sdf).rename(columns={0: year})
        for year, sdf in df.groupby("year", as_index=False)
    ],
    axis="columns"
)
end = perf_counter()
print(f"Duration version 2: {end - start:.2f} seconds")

没花太长时间:

Duration version 1: 10.58 seconds
Duration version 2: 11.06 seconds