获取映射值的余弦距离的有效(不是 DataFrame.apply)方法

Efficient (not DataFrame.apply) way of getting cosine distance for mapped values

这是我生成的一些数据:

import numpy as np
import pandas as pd
import scipy
import scipy.spatial

df = pd.DataFrame(
    {
        "item_1": np.random.randint(low=0, high=10, size=1000),
        "item_2": np.random.randint(low=0, high=10, size=1000),
    }
)
embeddings = {item_id: np.random.randn(100) for item_id in range(0, 10)}


def get_distance(item_1, item_2):
    arr1 = embeddings[item_1]
    arr2 = embeddings[item_2]
    return scipy.spatial.distance.cosine(arr1, arr2)

我想对每一行应用 get_distance。我能做到:

df.apply(lambda row: get_distance(row["item_1"], row["item_2"]), axis=1)

但这对于大型数据集来说会非常慢。

有没有办法不用DataFrame.apply就可以计算每行对应的嵌入的余弦相似度?

您可以使用 numpy.vectorize 向量化对 cosine 的调用。速度略有提高(34 毫秒对 53 毫秒)

vec_cosine = np.vectorize(scipy.spatial.distance.cosine)
vec_cosine(df['item_1'].map(embeddings),
           df['item_2'].map(embeddings))

输出:

array([0.90680875, 0.90999454, 0.99212814, 1.12455852, 1.06354469,
       0.95542037, 1.07133003, 1.07133003, 0.        , 1.00837058,
       0.        , 0.93961103, 0.8943738 , 1.04872436, 1.21171375,
       1.04621226, 0.90392229, 1.0365102 , 0.        , 0.90180297,
       0.90180297, 1.04516879, 0.94877277, 0.90180297, 0.93713404,
...
       1.17548653, 1.11700641, 0.97926805, 0.8943738 , 0.93961103,
       1.21171375, 0.91817959, 0.91817959, 1.04674315, 0.88210679,
       1.11806218, 1.07816675, 1.00837058, 1.12455852, 1.04516879,
       0.93713404, 0.93713404, 0.95542037, 0.93876964, 0.91817959])

直接使用矢量化 numpy 操作要快得多:

item_1_embedded = np.array([embeddings[x]for x in df.item_1])
item_2_embedded = np.array([embeddings[x]for x in df.item_2])
cos_dist = 1 - np.sum(item_1_embedded*item_2_embedded, axis=1)/(np.linalg.norm(item_1_embedded, axis=1)*np.linalg.norm(item_2_embedded, axis=1))

(此版本在我的电脑上平均运行 771 µs,与 37.4 ms 相比 DataFrame.apply,这使得纯 numpy 版本快了大约 50 倍)。

对于scipy版本

%%timeit
df.apply(lambda row: get_distance(row["item_1"], row["item_2"]), axis=1)
# 38.3 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

为了它的价值,我添加了额外复杂的 numba

考虑内存和numpy广播使用tmp分配,我用for循环

另外值得考虑传递参数,也许你可以传递向量而不是字典。

另外第一个运行由于编译速度慢

你也可以让它与 numba 并行

@nb.njit((nb.float64[:, ::100], nb.float64[:, ::100]))
def cos(a, b):
    norm_a = np.empty((a.shape[0],), dtype=np.float64)
    norm_b = np.empty((b.shape[0],), dtype=np.float64)
    cos_ab = np.empty((a.shape[0],), dtype=np.float64)

    for i in nb.prange(a.shape[0]):
        sq_norm = 0.0
        for j in range(100):
            sq_norm += a[i][j] ** 2
        norm_a[i] = sq_norm ** 0.5
    
    for i in nb.prange(b.shape[0]):
        sq_norm = 0.0
        for j in range(100):
            sq_norm += b[i][j] ** 2
        norm_b[i] = sq_norm ** 0.5
    
    for i in nb.prange(a.shape[0]):
        dot = 0.0
        for j in range(100):
            dot += a[i][j] * b[i][j]
        cos_ab[i] = 1 - dot / (norm_a[i] * norm_b[i])
    return cos_ab
%%timeit
cos(item_1_embedded, item_2_embedded)
# 218 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)