如何从一长串 source/target 对中创建邻接矩阵?
How can I create an adjacency matrix from a long list of source/target pairs?
给定以下数据:
Class Name
====== =============
Math John Smith
-------------------------
Math Jenny Simmons
-------------------------
English Sarah Blume
-------------------------
English John Smith
-------------------------
Chemistry Roger Tisch
-------------------------
Chemistry Jenny Simmons
-------------------------
Physics Sarah Blume
-------------------------
Physics Jenny Simmons
我有一个 类 列表和每个列表中的名字,如下所示:
[
{class: 'Math', student: 'John Smith'},
{class: 'Math', student: 'Jenny Simmons'},
{class: 'English', student: 'Sarah Blume'},
{class: 'English', student: 'John Smith'},
{class: 'Chemistry', student: 'John Smith'},
{class: 'Chemistry', student: 'Jenny Simmons'},
{class: 'Physics', student: 'Sarah Blume'},
{class: 'Physics', student: 'Jenny Simmons'},
]
我想创建一个邻接矩阵,作为输入,它具有以下结构,显示每对 类:
之间共有的学生数
我如何才能以最高效的方式在 python/pandas 中做到这一点?我的列表中有 ~19M 这些 class/student 对 (~240MB)。
您可以像这样准备邻接矩阵的数据:
# create the "class-tuples" by
# joining the dataframe with itself
df_cross= df.merge(df, on='student', suffixes=['_left', '_right'])
# remove the duplicate tuples
# --> this will get you a upper / or lower
# triangular matrix with diagonal = 0
# if you rather want to have a full matrix
# just change the >= to == below
del_indexer= (df_cross['class_left']>=df_cross['class_right'])
df_cross.drop(df_cross[del_indexer].index, inplace=True)
# create the counts / lists
grouby_obj= df_cross.groupby(['class_left', 'class_right'])
result= grouby_obj.count()
result.columns= ['value']
# if you want to have lists of student names
# that have the course-combination in
# common, you can do it with the following line
# otherwise just remove it (I guess with a
# dataset of the size you mentioned, it will
# consume a lot of memory)
result['students']= grouby_obj.agg(list)
完整的输出如下所示:
Out[133]:
value students
class_left class_right
Chemistry English 1 [John Smith]
Math 2 [John Smith, Jenny Simmons]
Physics 1 [Jenny Simmons]
English Math 1 [John Smith]
Physics 1 [Sarah Blume]
Math Physics 1 [Jenny Simmons]
然后您可以使用@piRSquared 的方法旋转它,或者像这样:
result['value'].unstack()
Out[137]:
class_right English Math Physics
class_left
Chemistry 1.0 2.0 1.0
English NaN 1.0 1.0
Math NaN NaN 1.0
或者,如果您还需要名称:
result.unstack()
Out[138]:
value students
class_right English Math Physics English Math Physics
class_left
Chemistry 1.0 2.0 1.0 [John Smith] [John Smith, Jenny Simmons] [Jenny Simmons]
English NaN 1.0 1.0 NaN [John Smith] [Sarah Blume]
Math NaN NaN 1.0 NaN NaN [Jenny Simmons]
给定以下数据:
Class Name
====== =============
Math John Smith
-------------------------
Math Jenny Simmons
-------------------------
English Sarah Blume
-------------------------
English John Smith
-------------------------
Chemistry Roger Tisch
-------------------------
Chemistry Jenny Simmons
-------------------------
Physics Sarah Blume
-------------------------
Physics Jenny Simmons
我有一个 类 列表和每个列表中的名字,如下所示:
[
{class: 'Math', student: 'John Smith'},
{class: 'Math', student: 'Jenny Simmons'},
{class: 'English', student: 'Sarah Blume'},
{class: 'English', student: 'John Smith'},
{class: 'Chemistry', student: 'John Smith'},
{class: 'Chemistry', student: 'Jenny Simmons'},
{class: 'Physics', student: 'Sarah Blume'},
{class: 'Physics', student: 'Jenny Simmons'},
]
我想创建一个邻接矩阵,作为输入,它具有以下结构,显示每对 类:
之间共有的学生数我如何才能以最高效的方式在 python/pandas 中做到这一点?我的列表中有 ~19M 这些 class/student 对 (~240MB)。
您可以像这样准备邻接矩阵的数据:
# create the "class-tuples" by
# joining the dataframe with itself
df_cross= df.merge(df, on='student', suffixes=['_left', '_right'])
# remove the duplicate tuples
# --> this will get you a upper / or lower
# triangular matrix with diagonal = 0
# if you rather want to have a full matrix
# just change the >= to == below
del_indexer= (df_cross['class_left']>=df_cross['class_right'])
df_cross.drop(df_cross[del_indexer].index, inplace=True)
# create the counts / lists
grouby_obj= df_cross.groupby(['class_left', 'class_right'])
result= grouby_obj.count()
result.columns= ['value']
# if you want to have lists of student names
# that have the course-combination in
# common, you can do it with the following line
# otherwise just remove it (I guess with a
# dataset of the size you mentioned, it will
# consume a lot of memory)
result['students']= grouby_obj.agg(list)
完整的输出如下所示:
Out[133]:
value students
class_left class_right
Chemistry English 1 [John Smith]
Math 2 [John Smith, Jenny Simmons]
Physics 1 [Jenny Simmons]
English Math 1 [John Smith]
Physics 1 [Sarah Blume]
Math Physics 1 [Jenny Simmons]
然后您可以使用@piRSquared 的方法旋转它,或者像这样:
result['value'].unstack()
Out[137]:
class_right English Math Physics
class_left
Chemistry 1.0 2.0 1.0
English NaN 1.0 1.0
Math NaN NaN 1.0
或者,如果您还需要名称:
result.unstack()
Out[138]:
value students
class_right English Math Physics English Math Physics
class_left
Chemistry 1.0 2.0 1.0 [John Smith] [John Smith, Jenny Simmons] [Jenny Simmons]
English NaN 1.0 1.0 NaN [John Smith] [Sarah Blume]
Math NaN NaN 1.0 NaN NaN [Jenny Simmons]