Scipy 余弦相似度与 sklearn 余弦相似度
Scipy cosine similarity vs sklearn cosine similarity
我注意到 scipy
和 sklearn
都有余弦 similarity/cosine 距离函数。我想测试每个向量对的速度:
setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
import1 = "from sklearn.metrics.pairwise import cosine_similarity"
stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]"
import2 = "from scipy.spatial.distance import cosine"
stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]"
import timeit
print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000))
print("scipy: ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))
sklearn: 11.072769448000145
scipy: 1.9755544730005568
sklearn
的运行速度几乎比 scipy
慢 10 倍(即使您删除了 sklearn 示例的数组重塑并生成已经处于正确形状的数据)。为什么一个明显比另一个慢?
如评论部分所述,我认为比较不公平,主要是因为 sklearn.metrics.pairwise.cosine_similarity
旨在比较给定输入二维数组中样本的成对 distance/similarity .另一方面,scipy.spatial.distance.cosine
旨在计算两个一维数组的余弦距离。
也许更公平的比较是使用 scipy.spatial.distance.cdist
与 sklearn.metrics.pairwise.cosine_similarity
,其中两者都计算给定数组中样本的成对距离。然而,令我惊讶的是,这表明 sklearn 实现比 scipy 实现快得多(我目前没有对此的解释!)。这是实验:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cdist
x = np.random.rand(1000,1000)
y = np.random.rand(1000,1000)
def sklearn_cosine():
return cosine_similarity(x, y)
def scipy_cosine():
return 1. - cdist(x, y, 'cosine')
# Make sure their result is the same.
assert np.allclose(sklearn_cosine(), scipy_cosine())
这里是计时结果:
%timeit sklearn_cosine()
10 loops, best of 3: 74 ms per loop
%timeit scipy_cosine()
1 loop, best of 3: 752 ms per loop
我注意到 scipy
和 sklearn
都有余弦 similarity/cosine 距离函数。我想测试每个向量对的速度:
setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
import1 = "from sklearn.metrics.pairwise import cosine_similarity"
stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]"
import2 = "from scipy.spatial.distance import cosine"
stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]"
import timeit
print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000))
print("scipy: ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))
sklearn: 11.072769448000145
scipy: 1.9755544730005568
sklearn
的运行速度几乎比 scipy
慢 10 倍(即使您删除了 sklearn 示例的数组重塑并生成已经处于正确形状的数据)。为什么一个明显比另一个慢?
如评论部分所述,我认为比较不公平,主要是因为 sklearn.metrics.pairwise.cosine_similarity
旨在比较给定输入二维数组中样本的成对 distance/similarity .另一方面,scipy.spatial.distance.cosine
旨在计算两个一维数组的余弦距离。
也许更公平的比较是使用 scipy.spatial.distance.cdist
与 sklearn.metrics.pairwise.cosine_similarity
,其中两者都计算给定数组中样本的成对距离。然而,令我惊讶的是,这表明 sklearn 实现比 scipy 实现快得多(我目前没有对此的解释!)。这是实验:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cdist
x = np.random.rand(1000,1000)
y = np.random.rand(1000,1000)
def sklearn_cosine():
return cosine_similarity(x, y)
def scipy_cosine():
return 1. - cdist(x, y, 'cosine')
# Make sure their result is the same.
assert np.allclose(sklearn_cosine(), scipy_cosine())
这里是计时结果:
%timeit sklearn_cosine()
10 loops, best of 3: 74 ms per loop
%timeit scipy_cosine()
1 loop, best of 3: 752 ms per loop