从电子表格中以 Python 方式创建邻接矩阵

Pythonically create adjacency matrix from spreadsheet

我有一个电子表格,其中列出了某个人报告在多个项目中与之共事的人员姓名。如果我将它作为数据框导入 pandas,它将如下所示:

       1                  2
Jane   ['Fred', 'Joe']    ['Joe', 'Fred', 'Bob']
Fred   ['Alex']           ['Jane']
Terry  NaN                ['Bob']
Bob    ['Joe']            ['Jane', 'Terry']
Alex   ['Fred']           NaN
Joe    ['Jane']           ['Jane']

我想创建一个如下所示的邻接矩阵:

      Jane  Fred  Terry  Bob  Alex  Joe
Jane  0     2     0      1    0     2
Fred  1     0     0      0    1     0
Terry 0     0     0      1    0     0
Bob   1     0     1      0    0     1
Alex  0     1     0      0    0     0
Joe   2     0     0      0    0     0

这个矩阵一般不会对称,因为与人们的报告不一致。我一直在通过遍历数据帧并相应地增加矩阵元素来创建邻接矩阵。显然,不推荐循环遍历数据帧并且效率低下,所以有人建议如何更 pythonically 地完成他的工作吗?

这是一种方法:

import pandas as pd
import ast

data = '''       1                  2
Jane   ['Fred', 'Joe']    ['Joe', 'Fred', 'Bob']
Fred   ['Alex']           ['Jane']
Terry  NaN                ['Bob']
Bob    ['Joe']            ['Jane', 'Terry']
Alex   ['Fred']           NaN
Joe    ['Jane']           ['Jane']'''

df = pd.read_csv(io.StringIO(data), sep='\s\s+', engine='python').fillna('[]').applymap(ast.literal_eval) #if your columns are already lists rather than string representations, use .fillna([]) and skip the applymap
df['all'] = df['1']+df['2'] #merge lists of columns 1 and 2

df_edges = df[['all']].explode('all').reset_index() #create new df by exploding the combined list
df_edges = df_edges.groupby(['index', 'all'])['all'].count().reset_index(name="count") #groupby and count the pairs

df_edges.pivot(index='index', columns='all', values='count').fillna(0) #create adjacency matrix with pivot

输出:

index Alex Bob Fred Jane Joe Terry
Alex 0 0 1 0 0 0
Bob 0 0 0 1 1 1
Fred 1 0 0 1 0 0
Jane 0 1 2 0 2 0
Joe 0 0 0 2 0 0
Terry 0 1 0 0 0 0

这是我过去使用的数据样本。

df = pd.DataFrame({
    'Name': ['Jane', 'Fred', 'Terry', 'Bob', 'Alex', 'Joe'],
    '1':[['Fred', 'Joe'], ['Alex'], np.nan,['Joe'], ['Fred'], ['Jane']],
    '2': [['Joe', 'Fred', 'Bob'], ['Jane'], ['Bob'], ['Jane', 'Terry'], np.nan, ['Jane']]
})

df.head()
    Name            1                 2
0   Jane  [Fred, Joe]  [Joe, Fred, Bob]
1   Fred       [Alex]            [Jane]
2  Terry          NaN             [Bob]
3    Bob        [Joe]     [Jane, Terry]
4   Alex       [Fred]               NaN

我使用 pandas 通过三个简单的步骤创建了邻接矩阵。

首先,我将数据融化为只有一列用于不同名称之间的所有连接,并删除了变量列。

dff = df.melt(id_vars=['Name']).drop('variable', axis=1)
     Name             value
0    Jane       [Fred, Joe]
1    Fred            [Alex]
2   Terry               NaN
3     Bob             [Joe]
4    Alex            [Fred]
5     Joe            [Jane]
6    Jane  [Joe, Fred, Bob]
7    Fred            [Jane]
8   Terry             [Bob]
9     Bob     [Jane, Terry]
10   Alex               NaN
11    Joe            [Jane]

其次,我使用 explode 方法将行分解为单独的行列表。

dff = dff.explode('value')
     Name  value
0    Jane   Fred
0    Jane    Joe
1    Fred   Alex
2   Terry    NaN
3     Bob    Joe
4    Alex   Fred
5     Joe   Jane
6    Jane    Joe
6    Jane   Fred
6    Jane    Bob
7    Fred   Jane
8   Terry    Bob
9     Bob   Jane
9     Bob  Terry
10   Alex    NaN
11    Joe   Jane

最后,为了创建邻接矩阵,我在 pandas 中使用了交叉表,它仅计算指定的两列中出现的次数。

pd.crosstab(dff['Name'], dff['value'])
value  Alex  Bob  Fred  Jane  Joe  Terry
Name                                    
Alex      0    0     1     0    0      0
Bob       0    0     0     1    1      1
Fred      1    0     0     1    0      0
Jane      0    1     2     0    2      0
Joe       0    0     0     2    0      0
Terry     0    1     0     0    0      0