Python:如何计算 pandas 数据框中对之间的协作?
Python: how to count collaborations between pairs in pandas dataframe?
我有一个这样的数据框
df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df
Item Name Weight
A Tom 4
A John 4
A Paul 4
B Tom 3
B Frank 3
C Tom 5
C John 5
C Richard 5
C James 5
对于每个人,我想要在 weight
期间平均拥有相同项目的人的列表
df1
Name People Times
Tom [John, Paul, Frank, Richard, James] [(1/4+1/5),1/4,1/3,1/5,1/5]
John [Tom, Richard, James] [(1/4+1/5),1/5,1/5]
Paul [Tom, John] [1/4,1/4]
Frank [Tom] [1/3]
Richard [Tom, John, James] [1/5,1/5,1/5]
James [Tom, John, Richard] [1/5,1/5,1/5]
为了在不考虑weight
的情况下统计合作次数,我做了:
#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])
#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1
#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
Name_x Name_y
0 Frank [Tom]
1 James [Tom, John, Richard]
2 John [Tom, Paul, Tom, Richard, James]
3 Paul [Tom, John]
4 Richard [Tom, John, James]
5 Tom [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
Name People times
0 Frank [Tom] [1]
1 James [John, Richard, Tom] [1, 1, 1]
2 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
3 Paul [John, Tom] [1, 1]
4 Richard [James, John, Tom] [1, 1, 1]
5 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
在最后一个数据框中,我有所有对之间的合作计数,但是我想要他们的合作加权计数
开始于:
df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'],
'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'],
'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]})
df1 = pd.merge(df, df, on=['Item'])
df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1)
您可以使用 .apply()
来创建值,并使用 .unstack()
来创建宽格式:
collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']
Name_y Frank James John Paul Richard Tom
Name_x
Frank NaN NaN NaN NaN NaN 0.333333
James NaN NaN 0.2 NaN 0.2 0.200000
John NaN 0.2 NaN 0.5 0.2 0.700000
Paul NaN NaN 0.5 NaN NaN 0.500000
Richard NaN 0.2 0.2 NaN NaN 0.200000
Tom 0.333333 0.2 0.7 0.5 0.2 NaN
然后遍历行并转换为列表:
df = pd.DataFrame(columns=['People', 'Times'])
for p, data in collab.iterrows():
s = data.dropna()
df.loc[p] = [s.index.tolist(), s.values]
People \
Frank [Tom]
James [John, Richard, Tom]
John [James, Paul, Richard, Tom]
Paul [John, Tom]
Richard [James, John, Tom]
Tom [Frank, James, John, Paul, Richard]
Times
Frank [0.333333333333]
James [0.2, 0.2, 0.2]
John [0.2, 0.5, 0.2, 0.7]
Paul [0.5, 0.5]
Richard [0.2, 0.2, 0.2]
Tom [0.333333333333, 0.2, 0.7, 0.5, 0.2]
我有一个这样的数据框
df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df
Item Name Weight
A Tom 4
A John 4
A Paul 4
B Tom 3
B Frank 3
C Tom 5
C John 5
C Richard 5
C James 5
对于每个人,我想要在 weight
df1
Name People Times
Tom [John, Paul, Frank, Richard, James] [(1/4+1/5),1/4,1/3,1/5,1/5]
John [Tom, Richard, James] [(1/4+1/5),1/5,1/5]
Paul [Tom, John] [1/4,1/4]
Frank [Tom] [1/3]
Richard [Tom, John, James] [1/5,1/5,1/5]
James [Tom, John, Richard] [1/5,1/5,1/5]
为了在不考虑weight
的情况下统计合作次数,我做了:
#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])
#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1
#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
Name_x Name_y
0 Frank [Tom]
1 James [Tom, John, Richard]
2 John [Tom, Paul, Tom, Richard, James]
3 Paul [Tom, John]
4 Richard [Tom, John, James]
5 Tom [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
Name People times
0 Frank [Tom] [1]
1 James [John, Richard, Tom] [1, 1, 1]
2 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
3 Paul [John, Tom] [1, 1]
4 Richard [James, John, Tom] [1, 1, 1]
5 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
在最后一个数据框中,我有所有对之间的合作计数,但是我想要他们的合作加权计数
开始于:
df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'],
'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'],
'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]})
df1 = pd.merge(df, df, on=['Item'])
df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1)
您可以使用 .apply()
来创建值,并使用 .unstack()
来创建宽格式:
collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']
Name_y Frank James John Paul Richard Tom
Name_x
Frank NaN NaN NaN NaN NaN 0.333333
James NaN NaN 0.2 NaN 0.2 0.200000
John NaN 0.2 NaN 0.5 0.2 0.700000
Paul NaN NaN 0.5 NaN NaN 0.500000
Richard NaN 0.2 0.2 NaN NaN 0.200000
Tom 0.333333 0.2 0.7 0.5 0.2 NaN
然后遍历行并转换为列表:
df = pd.DataFrame(columns=['People', 'Times'])
for p, data in collab.iterrows():
s = data.dropna()
df.loc[p] = [s.index.tolist(), s.values]
People \
Frank [Tom]
James [John, Richard, Tom]
John [James, Paul, Richard, Tom]
Paul [John, Tom]
Richard [James, John, Tom]
Tom [Frank, James, John, Paul, Richard]
Times
Frank [0.333333333333]
James [0.2, 0.2, 0.2]
John [0.2, 0.5, 0.2, 0.7]
Paul [0.5, 0.5]
Richard [0.2, 0.2, 0.2]
Tom [0.333333333333, 0.2, 0.7, 0.5, 0.2]