Pairwise similarity/similarity 矩阵计算优化
Pairwise similarity/similarity matrix calculation optimization
问题定义
问题
如何优化计算大量向量的成对余弦相似度(估计适合)?
正式定义
对于包含向量的两个集合 (A, B) - 需要为每个 a 和 b 生成成对余弦相似度 sim(a_i, b_j)。 (余弦相似度矩阵也适用,因为它很容易从矩阵转换为成对矩阵。)
我为什么要寻求帮助
这看起来是一个普遍的问题,因为在计算生物学、推荐系统等中需要计算这样的距离。但是我还没有找到一些合理的解决方案。
我无法解决的问题
根据定义,此问题的复杂度为 O(len_A * len_B * O(similarity_function)),因此 A 和 B 集合中的 10^6 个向量趋于巨大 运行时间
我对未来方向的假设
看起来,我们在这里做了很多无用的工作,因为相似性不是独立的(如果我们为百万个向量计算了 a_i 的相似性,并且 b_j 非常相似a_i - 我们计算了 900k 个向量的 b_j 相似度,我们可以估计与其余 100k 个向量的 b_j 相似度)。我假设这里可以使用索引之类的东西。
其他详细信息
- A 和 B 不相交。
- 向量维数已经减少到最小的合理值。
- 不需要简单的for循环优化。简而言之 - 这是优化这个的简短 guide - 最简单的循环给出了算法的清晰说明。
- 如果有一种算法也可以进行估计,我很感兴趣,所以如果我们的相似度足够接近但与真实的不完全相同也没关系。
- 不需要并行化。
- 我知道生成的相似度矩阵会很大。
- 我也很感兴趣,如果这是一种算法,它只允许从集合 B 中为集合 A 中的每个向量获取最相似的向量。
感谢您的参与。
代码示例
要求
python==3.6
pandas==0.25.0
scikit-learn==0.21.3
numpy==1.17.1
正在生成虚拟数据
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df_1 = pd.DataFrame({'object_id_1': range(10),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
df_2 = pd.DataFrame({'object_id_2': range(10,20),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
相似性生成函数
def get_similarities(df_1: pd.DataFrame, df_2: pd.DataFrame, meaningful_features:list) -> pd.DataFrame:
'''
This function generates features based similarity scores, between two groups of objects
Parameters
----------
df_1: pandas.DataFrame
DataFrame with features, and id_s of objects
df_2: pandas.DataFrame
DataFrame with features, and id_s of objects which has no id_s same to df_1
meaningful_features: list
Features columns to calculate similarity on
Returns
----------
similarities_of_objects: pandas.DataFrame
DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity',
where we have features similarity, for each object_1-object_2 pair.
Similarity - symmetric.
'''
objects_1 = [] # list of all objects from df_1
objects_2 = [] # list of all objects from df_2
similarities = [] # list of scores for object_1-object_2 pairs
for object_1 in df_1['object_id_1'].unique():
features_vector_1 = df_1[df_1['object_id_1'] == object_1][meaningful_features] # object_1 features vector
for object_2 in df_2['object_id_2'].unique():
features_vector_2 = df_2[df_2['object_id_2'] == object_2][meaningful_features] # object_2 features vector
objects_1.append(object_1)
objects_2.append(object_2)
similarities.append(cosine_similarity(X = np.array(features_vector_1)
,Y = np.array(features_vector_2)).item()) # similarities of vectors
sim_o1_to_o2 = pd.DataFrame()
sim_o1_to_o2['objects_1']= objects_1
sim_o1_to_o2['objects_2']= objects_2
sim_o1_to_o2['similarity']= similarities
return sim_o1_to_o2
产生相似性
get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])
使用Faiss
import faiss
dimension = 100
value1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatL2(d)
index.add(value1)
xq = value2
k= len(value1)
D, I = index.search(xq, k)
注意这里D是距离,I是值的Index
此外,value1 和 value2 只是 NumPy 数组。
PS: 先安装faiss
pip install faiss
如何从欧氏距离得到余弦相似度
仅针对最相似的向量
,还有计算欧几里得距离的替代方法,尤其是在您只需要顶部相似向量而不是整个相似矩阵的情况下。
用@Abhik Sarka提出的方法解决
这是我发布的确切问题的解决方案,使用@Abhik Sarkar 提出的方法。要具有余弦相似性,请确保您的向量之前已归一化。
该解决方案还允许您根据需要生成尽可能多的相似点,而不是必需的完整矩阵。
免责声明:解决方案侧重于可读性,而不是性能。
要求
python==3.6
pandas==0.25.0
numpy==1.17.1
faiss==1.5.3
正在生成虚拟数据
import pandas as pd
import numpy as np
import faiss
df_1 = pd.DataFrame({'object_id_1': range(10),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
df_2 = pd.DataFrame({'object_id_2': range(10,20),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
相似性生成函数
def get_similarities(df_1: pd.DataFrame,
df_2: pd.DataFrame,
meaningful_features:list,
n_neighbors:int = df_2.shape[0])->pd.DataFrame:
'''
This function generates features based similarity scores, between to groups of reviews
Parameters
----------
df_1: pandas.DataFrame
DataFrame with features, and id_s of objects
df_2: pandas.DataFrame
DataFrame with features, and id_s of objects which has no id_s same to df_1
meaningful_features: list
Features columns to calculate similarity on
n_neighbors: int
Number of most similar objects_2 for every object_1. By default - full similarity matrix generated.
(default = df_2.shape[0])
Returns
----------
similarities_of_objects: pandas.DataFrame
DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity',
where we have features similarity, for each object_1-object_2 pair.
Similarity - symmetric.
'''
d = len(meaningful_features) # dimensionality
res = np.empty(shape=[1, 3]) # res initialization
xb = np.float32(df_1[meaningful_features].values)
xb = np.ascontiguousarray(xb)
xq = np.float32(df_2[meaningful_features].values)
xq = np.ascontiguousarray(xq)
index = faiss.IndexFlatL2(d) # build the index
index.add(xb) # add vectors to the index
D, I = index.search(xq, n_neighbors) # actual search
for i in range(I.shape[0]):
object_id_1_v = [df_1["object_id_1"].iloc[i]]*n_neighbors
object_id_2_v = df_2["object_id_2"].iloc[I[i]]
similarities = 1-D[i]/2
neighbors_scores_for_target = np.stack((object_id_1_v, object_id_2_v, similarities), axis=-1)
res = np.concatenate((res, neighbors_scores_for_target))
res = res[1:] # remove line we've created during res initialization
resulting_df = pd.DataFrame({'object_id_1': res[:, 0],
'object_id_2': res[:, 1],
'similarity': res[:, 2] })
return resulting_df
产生相似性
get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])
问题定义
问题
如何优化计算大量向量的成对余弦相似度(估计适合)?
正式定义
对于包含向量的两个集合 (A, B) - 需要为每个 a 和 b 生成成对余弦相似度 sim(a_i, b_j)。 (余弦相似度矩阵也适用,因为它很容易从矩阵转换为成对矩阵。)
我为什么要寻求帮助
这看起来是一个普遍的问题,因为在计算生物学、推荐系统等中需要计算这样的距离。但是我还没有找到一些合理的解决方案。
我无法解决的问题
根据定义,此问题的复杂度为 O(len_A * len_B * O(similarity_function)),因此 A 和 B 集合中的 10^6 个向量趋于巨大 运行时间
我对未来方向的假设
看起来,我们在这里做了很多无用的工作,因为相似性不是独立的(如果我们为百万个向量计算了 a_i 的相似性,并且 b_j 非常相似a_i - 我们计算了 900k 个向量的 b_j 相似度,我们可以估计与其余 100k 个向量的 b_j 相似度)。我假设这里可以使用索引之类的东西。
其他详细信息
- A 和 B 不相交。
- 向量维数已经减少到最小的合理值。
- 不需要简单的for循环优化。简而言之 - 这是优化这个的简短 guide - 最简单的循环给出了算法的清晰说明。
- 如果有一种算法也可以进行估计,我很感兴趣,所以如果我们的相似度足够接近但与真实的不完全相同也没关系。
- 不需要并行化。
- 我知道生成的相似度矩阵会很大。
- 我也很感兴趣,如果这是一种算法,它只允许从集合 B 中为集合 A 中的每个向量获取最相似的向量。
感谢您的参与。
代码示例
要求
python==3.6
pandas==0.25.0
scikit-learn==0.21.3
numpy==1.17.1
正在生成虚拟数据
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df_1 = pd.DataFrame({'object_id_1': range(10),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
df_2 = pd.DataFrame({'object_id_2': range(10,20),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
相似性生成函数
def get_similarities(df_1: pd.DataFrame, df_2: pd.DataFrame, meaningful_features:list) -> pd.DataFrame:
'''
This function generates features based similarity scores, between two groups of objects
Parameters
----------
df_1: pandas.DataFrame
DataFrame with features, and id_s of objects
df_2: pandas.DataFrame
DataFrame with features, and id_s of objects which has no id_s same to df_1
meaningful_features: list
Features columns to calculate similarity on
Returns
----------
similarities_of_objects: pandas.DataFrame
DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity',
where we have features similarity, for each object_1-object_2 pair.
Similarity - symmetric.
'''
objects_1 = [] # list of all objects from df_1
objects_2 = [] # list of all objects from df_2
similarities = [] # list of scores for object_1-object_2 pairs
for object_1 in df_1['object_id_1'].unique():
features_vector_1 = df_1[df_1['object_id_1'] == object_1][meaningful_features] # object_1 features vector
for object_2 in df_2['object_id_2'].unique():
features_vector_2 = df_2[df_2['object_id_2'] == object_2][meaningful_features] # object_2 features vector
objects_1.append(object_1)
objects_2.append(object_2)
similarities.append(cosine_similarity(X = np.array(features_vector_1)
,Y = np.array(features_vector_2)).item()) # similarities of vectors
sim_o1_to_o2 = pd.DataFrame()
sim_o1_to_o2['objects_1']= objects_1
sim_o1_to_o2['objects_2']= objects_2
sim_o1_to_o2['similarity']= similarities
return sim_o1_to_o2
产生相似性
get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])
使用Faiss
import faiss
dimension = 100
value1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatL2(d)
index.add(value1)
xq = value2
k= len(value1)
D, I = index.search(xq, k)
注意这里D是距离,I是值的Index
此外,value1 和 value2 只是 NumPy 数组。
PS: 先安装faiss
pip install faiss
如何从欧氏距离得到余弦相似度
仅针对最相似的向量
用@Abhik Sarka提出的方法解决
这是我发布的确切问题的解决方案,使用@Abhik Sarkar 提出的方法。要具有余弦相似性,请确保您的向量之前已归一化。 该解决方案还允许您根据需要生成尽可能多的相似点,而不是必需的完整矩阵。
免责声明:解决方案侧重于可读性,而不是性能。
要求
python==3.6
pandas==0.25.0
numpy==1.17.1
faiss==1.5.3
正在生成虚拟数据
import pandas as pd
import numpy as np
import faiss
df_1 = pd.DataFrame({'object_id_1': range(10),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
df_2 = pd.DataFrame({'object_id_2': range(10,20),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
相似性生成函数
def get_similarities(df_1: pd.DataFrame,
df_2: pd.DataFrame,
meaningful_features:list,
n_neighbors:int = df_2.shape[0])->pd.DataFrame:
'''
This function generates features based similarity scores, between to groups of reviews
Parameters
----------
df_1: pandas.DataFrame
DataFrame with features, and id_s of objects
df_2: pandas.DataFrame
DataFrame with features, and id_s of objects which has no id_s same to df_1
meaningful_features: list
Features columns to calculate similarity on
n_neighbors: int
Number of most similar objects_2 for every object_1. By default - full similarity matrix generated.
(default = df_2.shape[0])
Returns
----------
similarities_of_objects: pandas.DataFrame
DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity',
where we have features similarity, for each object_1-object_2 pair.
Similarity - symmetric.
'''
d = len(meaningful_features) # dimensionality
res = np.empty(shape=[1, 3]) # res initialization
xb = np.float32(df_1[meaningful_features].values)
xb = np.ascontiguousarray(xb)
xq = np.float32(df_2[meaningful_features].values)
xq = np.ascontiguousarray(xq)
index = faiss.IndexFlatL2(d) # build the index
index.add(xb) # add vectors to the index
D, I = index.search(xq, n_neighbors) # actual search
for i in range(I.shape[0]):
object_id_1_v = [df_1["object_id_1"].iloc[i]]*n_neighbors
object_id_2_v = df_2["object_id_2"].iloc[I[i]]
similarities = 1-D[i]/2
neighbors_scores_for_target = np.stack((object_id_1_v, object_id_2_v, similarities), axis=-1)
res = np.concatenate((res, neighbors_scores_for_target))
res = res[1:] # remove line we've created during res initialization
resulting_df = pd.DataFrame({'object_id_1': res[:, 0],
'object_id_2': res[:, 1],
'similarity': res[:, 2] })
return resulting_df
产生相似性
get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])