在 NetworkX 中按年分组以计算每年的连接数
Group by year in NetworkX to calculate annual number of connections
我有一个包含两个 ID 和年份的数据框。同一行中的 ID 表示存在连接。我想按年份分组以计算某个 ID 每年的总连接数。
我使用 NetworkX 计算连接数,只考虑 ID1 和 ID2,但不知道如何按年份分组。
import pandas as pd
import networkx as nx
d = {'ID1': [21, 21, 21, 21, 21], 'ID2': [343252, 44134, 41314, 161345, 89479],'year': [2010, 2010, 2010, 2011, 2011]}
df = pd.DataFrame(data=d)
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'ID1', 'ID2')
dict = {}
for x in G.nodes:
dict[x] = len(G[x])
s = pd.Series(dict, name='connections')
df1 = s.to_frame().sort_values('connections', ascending=False)
这为我提供了与年份无关的连接数。
我想做的是为数据集中存在的每一年创建一个图表(数据跨越 30 年),并计算那一年的连接,并将其添加到数据库中。考虑到我有一个相当大的变量数据库,我可以添加任何修改来做到这一点吗?我考虑过按年创建一个循环来对数据进行排序并为每个数据创建一个图表,但是由于我有数百万行,所以效率很低。
2 条修改方法的建议:
def count_connects(sdf):
G = nx.from_pandas_edgelist(sdf, "ID1", "ID2")
return pd.DataFrame.from_dict(
{n: len(G[n]) for n in G.nodes}, orient="index"
)
# Version 1
df_connects = (
df.groupby("year").apply(count_connects)
.reset_index(level=1)
.rename(columns={"level_1": "node", 0: "connections"})
)
# Version 2
df_connects = pd.concat(
[
count_connects(sdf).rename(columns={0: year})
for year, sdf in df.groupby("year", as_index=False)
],
axis="columns"
)
示例数据框的结果:
node connections
year
2010 21 3
2010 343252 1
2010 44134 1
2010 41314 1
2011 21 2
2011 161345 1
2011 89479 1
2010 2011
21 3.0 2.0
41314 1.0 NaN
44134 1.0 NaN
89479 NaN 1.0
161345 NaN 1.0
343252 1.0 NaN
为了了解需要多长时间,我尝试了以下方法(1,000 种可能的 ID1
s,10,000 种可能的 ID2
s,2 年,总计 200 万行) :
from random import randint
from time import perf_counter
num_nodes_1 = 1_000
num_nodes_2 = 10_000
num_years = 2
start_year = 1999
num_rows_per_year = 1_000_000
df = pd.DataFrame(
[
[randint(1, num_nodes_1), randint(1, num_nodes_2), start_year + year]
for year in range(num_years)
for _ in range(num_rows_per_year)
],
columns=["ID1", "ID2", "year"]
)
print(df)
start = perf_counter()
df_connects = (
df.groupby("year").apply(count_connects)
.reset_index(level=1)
.rename(columns={"level_1": "node", 0: "connections"})
)
end = perf_counter()
print(f"Duration version 1: {end - start:.2f} seconds")
start = perf_counter()
df_connects = pd.concat(
[
count_connects(sdf).rename(columns={0: year})
for year, sdf in df.groupby("year", as_index=False)
],
axis="columns"
)
end = perf_counter()
print(f"Duration version 2: {end - start:.2f} seconds")
没花太长时间:
Duration version 1: 10.58 seconds
Duration version 2: 11.06 seconds
我有一个包含两个 ID 和年份的数据框。同一行中的 ID 表示存在连接。我想按年份分组以计算某个 ID 每年的总连接数。
我使用 NetworkX 计算连接数,只考虑 ID1 和 ID2,但不知道如何按年份分组。
import pandas as pd
import networkx as nx
d = {'ID1': [21, 21, 21, 21, 21], 'ID2': [343252, 44134, 41314, 161345, 89479],'year': [2010, 2010, 2010, 2011, 2011]}
df = pd.DataFrame(data=d)
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'ID1', 'ID2')
dict = {}
for x in G.nodes:
dict[x] = len(G[x])
s = pd.Series(dict, name='connections')
df1 = s.to_frame().sort_values('connections', ascending=False)
这为我提供了与年份无关的连接数。
我想做的是为数据集中存在的每一年创建一个图表(数据跨越 30 年),并计算那一年的连接,并将其添加到数据库中。考虑到我有一个相当大的变量数据库,我可以添加任何修改来做到这一点吗?我考虑过按年创建一个循环来对数据进行排序并为每个数据创建一个图表,但是由于我有数百万行,所以效率很低。
2 条修改方法的建议:
def count_connects(sdf):
G = nx.from_pandas_edgelist(sdf, "ID1", "ID2")
return pd.DataFrame.from_dict(
{n: len(G[n]) for n in G.nodes}, orient="index"
)
# Version 1
df_connects = (
df.groupby("year").apply(count_connects)
.reset_index(level=1)
.rename(columns={"level_1": "node", 0: "connections"})
)
# Version 2
df_connects = pd.concat(
[
count_connects(sdf).rename(columns={0: year})
for year, sdf in df.groupby("year", as_index=False)
],
axis="columns"
)
示例数据框的结果:
node connections
year
2010 21 3
2010 343252 1
2010 44134 1
2010 41314 1
2011 21 2
2011 161345 1
2011 89479 1
2010 2011
21 3.0 2.0
41314 1.0 NaN
44134 1.0 NaN
89479 NaN 1.0
161345 NaN 1.0
343252 1.0 NaN
为了了解需要多长时间,我尝试了以下方法(1,000 种可能的 ID1
s,10,000 种可能的 ID2
s,2 年,总计 200 万行) :
from random import randint
from time import perf_counter
num_nodes_1 = 1_000
num_nodes_2 = 10_000
num_years = 2
start_year = 1999
num_rows_per_year = 1_000_000
df = pd.DataFrame(
[
[randint(1, num_nodes_1), randint(1, num_nodes_2), start_year + year]
for year in range(num_years)
for _ in range(num_rows_per_year)
],
columns=["ID1", "ID2", "year"]
)
print(df)
start = perf_counter()
df_connects = (
df.groupby("year").apply(count_connects)
.reset_index(level=1)
.rename(columns={"level_1": "node", 0: "connections"})
)
end = perf_counter()
print(f"Duration version 1: {end - start:.2f} seconds")
start = perf_counter()
df_connects = pd.concat(
[
count_connects(sdf).rename(columns={0: year})
for year, sdf in df.groupby("year", as_index=False)
],
axis="columns"
)
end = perf_counter()
print(f"Duration version 2: {end - start:.2f} seconds")
没花太长时间:
Duration version 1: 10.58 seconds
Duration version 2: 11.06 seconds