Python:如何计算 pandas 数据框中对之间的协作?

Python: how to count collaborations between pairs in pandas dataframe?

我有一个这样的数据框

df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'], 
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
 'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df 
Item Name  Weight
A    Tom     4
A    John    4
A    Paul    4
B    Tom     3
B    Frank   3
C    Tom     5
C    John    5
C    Richard 5
C    James   5 

对于每个人,我想要在 weight

期间平均拥有相同项目的人的列表
df1 
Name              People                          Times
Tom     [John, Paul, Frank, Richard, James]       [(1/4+1/5),1/4,1/3,1/5,1/5]
John    [Tom, Richard, James]                     [(1/4+1/5),1/5,1/5]
Paul    [Tom, John]                               [1/4,1/4]
Frank   [Tom]                                     [1/3]
Richard [Tom, John, James]                        [1/5,1/5,1/5]
James   [Tom, John, Richard]                      [1/5,1/5,1/5]

为了在不考虑weight的情况下统计合作次数,我做了:

#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])

#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1

#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
    Name_x                                     Name_y
0    Frank                                      [Tom]
1    James                       [Tom, John, Richard]
2     John           [Tom, Paul, Tom, Richard, James]
3     Paul                                [Tom, John]
4  Richard                         [Tom, John, James]
5      Tom  [John, Paul, Frank, John, Richard, James]


#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
      Name                               People            times
0    Frank                                [Tom]              [1]
1    James                 [John, Richard, Tom]        [1, 1, 1]
2     John          [James, Paul, Richard, Tom]     [1, 1, 1, 2]
3     Paul                          [John, Tom]           [1, 1]
4  Richard                   [James, John, Tom]        [1, 1, 1]
5      Tom  [Frank, James, John, Paul, Richard]  [1, 1, 2, 1, 1]

在最后一个数据框中,我有所有对之间的合作计数,但是我想要他们的合作加权计数

开始于:

df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'],
                   'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'],
                   'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]})

df1 = pd.merge(df, df, on=['Item'])
df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1)

您可以使用 .apply() 来创建值,并使用 .unstack() 来创建宽格式:

collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']

Name_y      Frank  James  John  Paul  Richard       Tom
Name_x                                                 
Frank         NaN    NaN   NaN   NaN      NaN  0.333333
James         NaN    NaN   0.2   NaN      0.2  0.200000
John          NaN    0.2   NaN   0.5      0.2  0.700000
Paul          NaN    NaN   0.5   NaN      NaN  0.500000
Richard       NaN    0.2   0.2   NaN      NaN  0.200000
Tom      0.333333    0.2   0.7   0.5      0.2       NaN

然后遍历行并转换为列表:

df = pd.DataFrame(columns=['People', 'Times'])
for p, data in collab.iterrows():
    s = data.dropna()
    df.loc[p] = [s.index.tolist(), s.values]

                                      People  \
Frank                                  [Tom]   
James                   [John, Richard, Tom]   
John             [James, Paul, Richard, Tom]   
Paul                             [John, Tom]   
Richard                   [James, John, Tom]   
Tom      [Frank, James, John, Paul, Richard]   

                                        Times  
Frank                        [0.333333333333]  
James                         [0.2, 0.2, 0.2]  
John                     [0.2, 0.5, 0.2, 0.7]  
Paul                               [0.5, 0.5]  
Richard                       [0.2, 0.2, 0.2]  
Tom      [0.333333333333, 0.2, 0.7, 0.5, 0.2]