从电子表格中以 Python 方式创建邻接矩阵
Pythonically create adjacency matrix from spreadsheet
我有一个电子表格,其中列出了某个人报告在多个项目中与之共事的人员姓名。如果我将它作为数据框导入 pandas,它将如下所示:
1 2
Jane ['Fred', 'Joe'] ['Joe', 'Fred', 'Bob']
Fred ['Alex'] ['Jane']
Terry NaN ['Bob']
Bob ['Joe'] ['Jane', 'Terry']
Alex ['Fred'] NaN
Joe ['Jane'] ['Jane']
我想创建一个如下所示的邻接矩阵:
Jane Fred Terry Bob Alex Joe
Jane 0 2 0 1 0 2
Fred 1 0 0 0 1 0
Terry 0 0 0 1 0 0
Bob 1 0 1 0 0 1
Alex 0 1 0 0 0 0
Joe 2 0 0 0 0 0
这个矩阵一般不会对称,因为与人们的报告不一致。我一直在通过遍历数据帧并相应地增加矩阵元素来创建邻接矩阵。显然,不推荐循环遍历数据帧并且效率低下,所以有人建议如何更 pythonically 地完成他的工作吗?
这是一种方法:
import pandas as pd
import ast
data = ''' 1 2
Jane ['Fred', 'Joe'] ['Joe', 'Fred', 'Bob']
Fred ['Alex'] ['Jane']
Terry NaN ['Bob']
Bob ['Joe'] ['Jane', 'Terry']
Alex ['Fred'] NaN
Joe ['Jane'] ['Jane']'''
df = pd.read_csv(io.StringIO(data), sep='\s\s+', engine='python').fillna('[]').applymap(ast.literal_eval) #if your columns are already lists rather than string representations, use .fillna([]) and skip the applymap
df['all'] = df['1']+df['2'] #merge lists of columns 1 and 2
df_edges = df[['all']].explode('all').reset_index() #create new df by exploding the combined list
df_edges = df_edges.groupby(['index', 'all'])['all'].count().reset_index(name="count") #groupby and count the pairs
df_edges.pivot(index='index', columns='all', values='count').fillna(0) #create adjacency matrix with pivot
输出:
index
Alex
Bob
Fred
Jane
Joe
Terry
Alex
0
0
1
0
0
0
Bob
0
0
0
1
1
1
Fred
1
0
0
1
0
0
Jane
0
1
2
0
2
0
Joe
0
0
0
2
0
0
Terry
0
1
0
0
0
0
这是我过去使用的数据样本。
df = pd.DataFrame({
'Name': ['Jane', 'Fred', 'Terry', 'Bob', 'Alex', 'Joe'],
'1':[['Fred', 'Joe'], ['Alex'], np.nan,['Joe'], ['Fred'], ['Jane']],
'2': [['Joe', 'Fred', 'Bob'], ['Jane'], ['Bob'], ['Jane', 'Terry'], np.nan, ['Jane']]
})
df.head()
Name 1 2
0 Jane [Fred, Joe] [Joe, Fred, Bob]
1 Fred [Alex] [Jane]
2 Terry NaN [Bob]
3 Bob [Joe] [Jane, Terry]
4 Alex [Fred] NaN
我使用 pandas 通过三个简单的步骤创建了邻接矩阵。
首先,我将数据融化为只有一列用于不同名称之间的所有连接,并删除了变量列。
dff = df.melt(id_vars=['Name']).drop('variable', axis=1)
Name value
0 Jane [Fred, Joe]
1 Fred [Alex]
2 Terry NaN
3 Bob [Joe]
4 Alex [Fred]
5 Joe [Jane]
6 Jane [Joe, Fred, Bob]
7 Fred [Jane]
8 Terry [Bob]
9 Bob [Jane, Terry]
10 Alex NaN
11 Joe [Jane]
其次,我使用 explode 方法将行分解为单独的行列表。
dff = dff.explode('value')
Name value
0 Jane Fred
0 Jane Joe
1 Fred Alex
2 Terry NaN
3 Bob Joe
4 Alex Fred
5 Joe Jane
6 Jane Joe
6 Jane Fred
6 Jane Bob
7 Fred Jane
8 Terry Bob
9 Bob Jane
9 Bob Terry
10 Alex NaN
11 Joe Jane
最后,为了创建邻接矩阵,我在 pandas 中使用了交叉表,它仅计算指定的两列中出现的次数。
pd.crosstab(dff['Name'], dff['value'])
value Alex Bob Fred Jane Joe Terry
Name
Alex 0 0 1 0 0 0
Bob 0 0 0 1 1 1
Fred 1 0 0 1 0 0
Jane 0 1 2 0 2 0
Joe 0 0 0 2 0 0
Terry 0 1 0 0 0 0
我有一个电子表格,其中列出了某个人报告在多个项目中与之共事的人员姓名。如果我将它作为数据框导入 pandas,它将如下所示:
1 2
Jane ['Fred', 'Joe'] ['Joe', 'Fred', 'Bob']
Fred ['Alex'] ['Jane']
Terry NaN ['Bob']
Bob ['Joe'] ['Jane', 'Terry']
Alex ['Fred'] NaN
Joe ['Jane'] ['Jane']
我想创建一个如下所示的邻接矩阵:
Jane Fred Terry Bob Alex Joe
Jane 0 2 0 1 0 2
Fred 1 0 0 0 1 0
Terry 0 0 0 1 0 0
Bob 1 0 1 0 0 1
Alex 0 1 0 0 0 0
Joe 2 0 0 0 0 0
这个矩阵一般不会对称,因为与人们的报告不一致。我一直在通过遍历数据帧并相应地增加矩阵元素来创建邻接矩阵。显然,不推荐循环遍历数据帧并且效率低下,所以有人建议如何更 pythonically 地完成他的工作吗?
这是一种方法:
import pandas as pd
import ast
data = ''' 1 2
Jane ['Fred', 'Joe'] ['Joe', 'Fred', 'Bob']
Fred ['Alex'] ['Jane']
Terry NaN ['Bob']
Bob ['Joe'] ['Jane', 'Terry']
Alex ['Fred'] NaN
Joe ['Jane'] ['Jane']'''
df = pd.read_csv(io.StringIO(data), sep='\s\s+', engine='python').fillna('[]').applymap(ast.literal_eval) #if your columns are already lists rather than string representations, use .fillna([]) and skip the applymap
df['all'] = df['1']+df['2'] #merge lists of columns 1 and 2
df_edges = df[['all']].explode('all').reset_index() #create new df by exploding the combined list
df_edges = df_edges.groupby(['index', 'all'])['all'].count().reset_index(name="count") #groupby and count the pairs
df_edges.pivot(index='index', columns='all', values='count').fillna(0) #create adjacency matrix with pivot
输出:
index | Alex | Bob | Fred | Jane | Joe | Terry |
---|---|---|---|---|---|---|
Alex | 0 | 0 | 1 | 0 | 0 | 0 |
Bob | 0 | 0 | 0 | 1 | 1 | 1 |
Fred | 1 | 0 | 0 | 1 | 0 | 0 |
Jane | 0 | 1 | 2 | 0 | 2 | 0 |
Joe | 0 | 0 | 0 | 2 | 0 | 0 |
Terry | 0 | 1 | 0 | 0 | 0 | 0 |
这是我过去使用的数据样本。
df = pd.DataFrame({
'Name': ['Jane', 'Fred', 'Terry', 'Bob', 'Alex', 'Joe'],
'1':[['Fred', 'Joe'], ['Alex'], np.nan,['Joe'], ['Fred'], ['Jane']],
'2': [['Joe', 'Fred', 'Bob'], ['Jane'], ['Bob'], ['Jane', 'Terry'], np.nan, ['Jane']]
})
df.head()
Name 1 2
0 Jane [Fred, Joe] [Joe, Fred, Bob]
1 Fred [Alex] [Jane]
2 Terry NaN [Bob]
3 Bob [Joe] [Jane, Terry]
4 Alex [Fred] NaN
我使用 pandas 通过三个简单的步骤创建了邻接矩阵。
首先,我将数据融化为只有一列用于不同名称之间的所有连接,并删除了变量列。
dff = df.melt(id_vars=['Name']).drop('variable', axis=1)
Name value
0 Jane [Fred, Joe]
1 Fred [Alex]
2 Terry NaN
3 Bob [Joe]
4 Alex [Fred]
5 Joe [Jane]
6 Jane [Joe, Fred, Bob]
7 Fred [Jane]
8 Terry [Bob]
9 Bob [Jane, Terry]
10 Alex NaN
11 Joe [Jane]
其次,我使用 explode 方法将行分解为单独的行列表。
dff = dff.explode('value')
Name value
0 Jane Fred
0 Jane Joe
1 Fred Alex
2 Terry NaN
3 Bob Joe
4 Alex Fred
5 Joe Jane
6 Jane Joe
6 Jane Fred
6 Jane Bob
7 Fred Jane
8 Terry Bob
9 Bob Jane
9 Bob Terry
10 Alex NaN
11 Joe Jane
最后,为了创建邻接矩阵,我在 pandas 中使用了交叉表,它仅计算指定的两列中出现的次数。
pd.crosstab(dff['Name'], dff['value'])
value Alex Bob Fred Jane Joe Terry
Name
Alex 0 0 1 0 0 0
Bob 0 0 0 1 1 1
Fred 1 0 0 1 0 0
Jane 0 1 2 0 2 0
Joe 0 0 0 2 0 0
Terry 0 1 0 0 0 0