Pandas 矢量数据的距离矩阵性能
Pandas distance matrix performance with vector data
即使我发现了一些处理距离矩阵效率的线程,它们也都使用 int 或 float 矩阵。在我的例子中,我必须处理向量(频率的 orderedDict),并且我最终只得到一个非常慢的方法,该方法不适用于大型数据帧(300,000 x 300,000)。
如何让流程更优化?
非常感谢任何帮助,这个问题一直困扰着我:)
考虑DataFrame df如:
>>> df
vectors
id
1 {dict1}
2 {dict2}
3 {dict3}
4 {dict4}
其中 {dict#}
orderedDict{event1: 1,
event2: 5,
event3: 0,
...}
return两个向量之间距离的函数:
def vectorDistance(a, b, df_vector):
# Calculate distance between a & b
# based on the vector from df_vector.
return distance
[in]: vectorDistance({dict1}, {dict2})
[out]: distance
一个想要的输出:
1 2 3 4
id
1 0 1<->2 1<->3 1<->4
2 1<->2 0 ... ...
3 1<->3 ... 0 ...
4 1<->4 ... ... 0
(其中 1<->2 是矢量 1 和 2 之间的浮点距离)
使用的方法:
import pandas as pd
matrix = pd.concat([df, df.T], axis=1)
for index in matrix.index:
for col in matrix.columns:
matrix.ix[col, index] = vectorDistance(col, index, df)
>>> matrix
5072142538 5072134420 4716823618 ...
udid
5072142538 0.00000 0.01501 0.06002 ...
5072134420 0.01501 0.00000 0.09037 ...
4716823618 0.06002 0.09037 0.00000 ...
... ... ... ...
编辑:
最小示例
注意:事件可以从一个 {dict} 到另一个不同,但在函数中传递时就可以了。我的问题更多的是如何通过正确的 a 和 b 快速填充单元格。
我正在使用余弦距离,因为它非常适合像我这样的向量。
from collections import Counter
import pandas as pd
from math import sqrt
raw_data = {'counters_': {4716823618: Counter({51811: 1, 51820: 1, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 8, 51853: 5, 51854: 4, 51856: 24, 51903: 11, 51904: 12, 51905: 3, 51906: 19, 51908: 230, 51922: 24, 51927: 19, 51931: 2, 106282: 9, 112830: 1, 119453: 1, 165062: 80, 168904: 3, 180354: 19, 180437: 33, 185824: 117, 186171: 14, 187101: 1, 190827: 7, 201629: 1, 209318: 37}), 5072134420: Counter({51811: 1, 51812: 1, 51820: 1, 51833: 56, 51835: 9, 51843: 49, 51848: 2, 51852: 11, 51853: 4, 51854: 4, 51856: 28, 51885: 1, 51903: 17, 51904: 17, 51905: 9, 51906: 14, 51908: 225, 51927: 29, 51931: 2, 106282: 19, 112830: 2, 168904: 9, 180354: 14, 185824: 219, 186171: 7, 187101: 1, 190827: 6, 201629: 2, 209318: 41}), 5072142538: Counter({51811: 4, 51812: 4, 51820: 4, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 6, 51853: 3, 51854: 3, 51856: 18, 51885: 1, 51903: 17, 51904: 16, 51905: 3, 51906: 24, 51908: 258, 51927: 20, 51931: 8, 106282: 16, 112830: 2, 168904: 3, 180354: 24, 185824: 180, 186171: 10, 187101: 1, 190827: 7, 201629: 2, 209318: 52})}}
def vectorDistance(index, col):
a = dict(df[df.index == index]["counters_"].values[0])
b = dict(df[df.index == col]["counters_"].values[0])
return abs(np.round(1-(similarity(a,b)),5))
def scalar(collection):
total = 0
for coin, count in collection.items():
total += count * count
return sqrt(total)
def similarity(A,B):
total = 0
for kind in A:
if kind in B:
total += A[kind] * B[kind]
return float(total) / (scalar(A) * scalar(B))
df = pd.DataFrame(raw_data)
matrix = pd.concat([df, df.T], axis=1)
matrix.drop("counters_",0,inplace=True)
matrix.drop("counters_",1,inplace=True)
for index in matrix.index:
for col in matrix.columns:
matrix.ix[col,index] = vectorDistance(col,index)
matrix
您不想在数据框中存储字典。使用 from_dict
方法读取您的数据框:
df = pd.DataFrame.from_dict(raw_data['counters_'],orient='index')
然后您可以应用 numpy/scipy 矢量化方法来计算余弦相似度,如 What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
这肯定比使用 for
循环更有效且更易于阅读。
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
>>> df.head()
4716823618 5072134420 5072142538
51811 1 1 4
51812 NaN 1 4
51820 1 1 4
51833 56 56 56
51835 8 9 8
# raw_data no longer needed. Delete to reduce memory footprint.
del raw_data
# Create scalars.
scalars = ((df ** 2).sum()) ** .5
>>> scalars
4716823618 289.679133
5072134420 330.548030
5072142538 331.957829
dtype: float64
def v_dist(col_1, col_2):
return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() /
(scalars.iloc[col_1] * scalars.iloc[col_2]))
>>> v_dist(0, 1)
0.09036665882900885
>>> v_dist(0, 2)
0.060016436804916085
>>> v_dist(1, 2)
0.015009898476505357
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
>>> m
4716823618 5072134420 5072142538
4716823618 NaN NaN NaN
5072134420 NaN NaN NaN
5072142538 NaN NaN NaN
for row in range(m.shape[0]):
for col in range(row, m.shape[1]): # Note: m.shape[0] equals m.shape[1]
if row == col:
# No need to calculate value for diagonal.
m.iat[row, col] = 0
else:
# Do two calculation in one due to symmetry.
m.iat[row, col] = m.iat[col, row] = v_dist(row, col)
>>> m
4716823618 5072134420 5072142538
4716823618 0.000000 0.090367 0.060016
5072134420 0.090367 0.000000 0.015010
5072142538 0.060016 0.015010 0.000000
将所有这些包装成一个函数:
def calc_matrix(raw_data):
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
scalars = ((df ** 2).sum()) ** .5
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
for row in range(m.shape[0]):
for col in range(row, m.shape[1]):
if row == col:
m.iat[row, col] = 0
else:
m.iat[row, col] = m.iat[col, row] = (1 -
(df.iloc[:, row] * df.iloc[:, col]).sum() /
(scalars.iloc[row] * scalars.iloc[col]))
return m
即使我发现了一些处理距离矩阵效率的线程,它们也都使用 int 或 float 矩阵。在我的例子中,我必须处理向量(频率的 orderedDict),并且我最终只得到一个非常慢的方法,该方法不适用于大型数据帧(300,000 x 300,000)。
如何让流程更优化?
非常感谢任何帮助,这个问题一直困扰着我:)
考虑DataFrame df如:
>>> df
vectors
id
1 {dict1}
2 {dict2}
3 {dict3}
4 {dict4}
其中 {dict#}
orderedDict{event1: 1,
event2: 5,
event3: 0,
...}
return两个向量之间距离的函数:
def vectorDistance(a, b, df_vector):
# Calculate distance between a & b
# based on the vector from df_vector.
return distance
[in]: vectorDistance({dict1}, {dict2})
[out]: distance
一个想要的输出:
1 2 3 4
id
1 0 1<->2 1<->3 1<->4
2 1<->2 0 ... ...
3 1<->3 ... 0 ...
4 1<->4 ... ... 0
(其中 1<->2 是矢量 1 和 2 之间的浮点距离)
使用的方法:
import pandas as pd
matrix = pd.concat([df, df.T], axis=1)
for index in matrix.index:
for col in matrix.columns:
matrix.ix[col, index] = vectorDistance(col, index, df)
>>> matrix
5072142538 5072134420 4716823618 ...
udid
5072142538 0.00000 0.01501 0.06002 ...
5072134420 0.01501 0.00000 0.09037 ...
4716823618 0.06002 0.09037 0.00000 ...
... ... ... ...
编辑:
最小示例
注意:事件可以从一个 {dict} 到另一个不同,但在函数中传递时就可以了。我的问题更多的是如何通过正确的 a 和 b 快速填充单元格。
我正在使用余弦距离,因为它非常适合像我这样的向量。
from collections import Counter
import pandas as pd
from math import sqrt
raw_data = {'counters_': {4716823618: Counter({51811: 1, 51820: 1, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 8, 51853: 5, 51854: 4, 51856: 24, 51903: 11, 51904: 12, 51905: 3, 51906: 19, 51908: 230, 51922: 24, 51927: 19, 51931: 2, 106282: 9, 112830: 1, 119453: 1, 165062: 80, 168904: 3, 180354: 19, 180437: 33, 185824: 117, 186171: 14, 187101: 1, 190827: 7, 201629: 1, 209318: 37}), 5072134420: Counter({51811: 1, 51812: 1, 51820: 1, 51833: 56, 51835: 9, 51843: 49, 51848: 2, 51852: 11, 51853: 4, 51854: 4, 51856: 28, 51885: 1, 51903: 17, 51904: 17, 51905: 9, 51906: 14, 51908: 225, 51927: 29, 51931: 2, 106282: 19, 112830: 2, 168904: 9, 180354: 14, 185824: 219, 186171: 7, 187101: 1, 190827: 6, 201629: 2, 209318: 41}), 5072142538: Counter({51811: 4, 51812: 4, 51820: 4, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 6, 51853: 3, 51854: 3, 51856: 18, 51885: 1, 51903: 17, 51904: 16, 51905: 3, 51906: 24, 51908: 258, 51927: 20, 51931: 8, 106282: 16, 112830: 2, 168904: 3, 180354: 24, 185824: 180, 186171: 10, 187101: 1, 190827: 7, 201629: 2, 209318: 52})}}
def vectorDistance(index, col):
a = dict(df[df.index == index]["counters_"].values[0])
b = dict(df[df.index == col]["counters_"].values[0])
return abs(np.round(1-(similarity(a,b)),5))
def scalar(collection):
total = 0
for coin, count in collection.items():
total += count * count
return sqrt(total)
def similarity(A,B):
total = 0
for kind in A:
if kind in B:
total += A[kind] * B[kind]
return float(total) / (scalar(A) * scalar(B))
df = pd.DataFrame(raw_data)
matrix = pd.concat([df, df.T], axis=1)
matrix.drop("counters_",0,inplace=True)
matrix.drop("counters_",1,inplace=True)
for index in matrix.index:
for col in matrix.columns:
matrix.ix[col,index] = vectorDistance(col,index)
matrix
您不想在数据框中存储字典。使用 from_dict
方法读取您的数据框:
df = pd.DataFrame.from_dict(raw_data['counters_'],orient='index')
然后您可以应用 numpy/scipy 矢量化方法来计算余弦相似度,如 What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
这肯定比使用 for
循环更有效且更易于阅读。
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
>>> df.head()
4716823618 5072134420 5072142538
51811 1 1 4
51812 NaN 1 4
51820 1 1 4
51833 56 56 56
51835 8 9 8
# raw_data no longer needed. Delete to reduce memory footprint.
del raw_data
# Create scalars.
scalars = ((df ** 2).sum()) ** .5
>>> scalars
4716823618 289.679133
5072134420 330.548030
5072142538 331.957829
dtype: float64
def v_dist(col_1, col_2):
return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() /
(scalars.iloc[col_1] * scalars.iloc[col_2]))
>>> v_dist(0, 1)
0.09036665882900885
>>> v_dist(0, 2)
0.060016436804916085
>>> v_dist(1, 2)
0.015009898476505357
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
>>> m
4716823618 5072134420 5072142538
4716823618 NaN NaN NaN
5072134420 NaN NaN NaN
5072142538 NaN NaN NaN
for row in range(m.shape[0]):
for col in range(row, m.shape[1]): # Note: m.shape[0] equals m.shape[1]
if row == col:
# No need to calculate value for diagonal.
m.iat[row, col] = 0
else:
# Do two calculation in one due to symmetry.
m.iat[row, col] = m.iat[col, row] = v_dist(row, col)
>>> m
4716823618 5072134420 5072142538
4716823618 0.000000 0.090367 0.060016
5072134420 0.090367 0.000000 0.015010
5072142538 0.060016 0.015010 0.000000
将所有这些包装成一个函数:
def calc_matrix(raw_data):
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
scalars = ((df ** 2).sum()) ** .5
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
for row in range(m.shape[0]):
for col in range(row, m.shape[1]):
if row == col:
m.iat[row, col] = 0
else:
m.iat[row, col] = m.iat[col, row] = (1 -
(df.iloc[:, row] * df.iloc[:, col]).sum() /
(scalars.iloc[row] * scalars.iloc[col]))
return m