混合数值和分类数据观测值之间成对距离计算的有效实现

Question

我正在从事一个数据科学项目，在该项目中我必须计算数据集中每对观测值之间的欧氏距离。

由于我正在处理非常大的数据集，因此我必须使用高效的成对距离计算实现（在内存使用和计算时间方面）。

一种解决方案是使用 Scipy 中的 pdist 函数，它 returns 一维数组中的结果，没有重复实例。

但是，此函数无法处理分类变量。对于这些，我想在值相同时将距离设置为 0，否则设置为 1。

我已尝试在 Python 中使用 Numba 实现此变体。该函数将包含所有观察值的二维 Numpy 数组和包含变量类型（float64 或 category）的一维数组作为输入。

这是代码：

import numpy as np
from numba.decorators import autojit

def pairwise(X, types):
    m = X.shape[0]
    n = X.shape[1]

    D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float)
    ind = 0

    for i in range(m):
        for j in range(i+1, m):
            d = 0.0

            for k in range(n):
                if types[k] == 'float64':
                    tmp = X[i, k] - X[j, k]
                    d += tmp * tmp
                else:
                    if X[i, k] != X[j, k]:
                        d += 1.

            D[ind] = np.sqrt(d)
            ind += 1

    return D.reshape(1, -1)[0]

pairwise_numba = autojit(pairwise)

vectors = np.random.rand(20000, 100)
types = np.array(['float64']*100)

dists = pairwise_numba(vectors, types)

尽管使用了 Numba，这个实现还是很慢。是否可以改进我的代码以使其更快？

Answer 1

autojit 已弃用，recommended 改为使用 jit。而且几乎总是你应该使用 jit(nopython=True) 如果某些东西不能从 python.

中降低，这将使 numba 失败

在您的代码中使用 nopython 揭示了两个问题。一个是简单的修复——这一行需要引用特定的 numpy 类型而不是 float

 - D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float)
 + D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float64)

第二个是您使用字符串来保存类型信息 - numba 对使用字符串的支持有限。您可以改为将类型信息编码为数字数组，例如0 表示数字，1 表示分类。所以一个实现可能是。

@jit(nopython=True)
def pairwise_nopython(X, types):
    m = X.shape[0]
    n = X.shape[1]

    D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float64)
    ind = 0

    for i in range(m):
        for j in range(i+1, m):
            d = 0.0

            for k in range(n):
                if types[k] == 0: #numeric
                    tmp = X[i, k] - X[j, k]
                    d += tmp * tmp
                else:
                    if X[i, k] != X[j, k]:
                        d += 1.

            D[ind] = np.sqrt(d)
            ind += 1

    return D.reshape(1, -1)[0]

Answer 2

如果你真的想让 numba 快速执行，你需要 jit nopython 模式下的函数，否则 numba 可能会退回到更慢的对象模式（并且可能非常慢） .

但是您的函数无法在 nopython 模式下编译（从 numba 版本 0.43.1 开始），这是因为：

dtype 参数 np.empty。 np.float 只是 Pythons float 并且会被 NumPy（但不是 numba）翻译成 np.float_。如果您使用 numba，则必须使用它。
numba 中缺少字符串支持。所以 types[k] == 'float64' 行不会编译。

第一个问题很简单。关于第二个问题：与其尝试使字符串比较工作，不如提供一个布尔数组。使用布尔数组并评估一个布尔值的真实性也比比较最多 7 个字符要快得多。特别是如果它在最内层的循环中！

所以它可能看起来像这样：

import numpy as np
import numba as nb

@nb.njit
def pairwise_numba(X, is_float_type):
    m = X.shape[0]
    n = X.shape[1]

    D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float64)  # corrected dtype
    ind = 0

    for i in range(m):
        for j in range(i+1, m):
            d = 0.0

            for k in range(n):
                if is_float_type[k]:
                    tmp = X[i, k] - X[j, k]
                    d += tmp * tmp
                else:
                    if X[i, k] != X[j, k]:
                        d += 1.

            D[ind] = np.sqrt(d)
            ind += 1

    return D.reshape(1, -1)[0]

dists = pairwise_numba(vectors, types == 'float64')  # pass in the boolean array

但是，如果将浮点类型上的 scipy.spatial.distances.pdist 与 numba 逻辑结合使用以计算不相等的类别，则可以简化逻辑：

from scipy.spatial.distance import pdist

@nb.njit
def categorial_sum(X):
    m = X.shape[0]
    n = X.shape[1]
    D = np.zeros(int(m * (m - 1) / 2), dtype=np.float64)  # corrected dtype
    ind = 0

    for i in range(m):
        for j in range(i+1, m):
            d = 0.0
            for k in range(n):
                if X[i, k] != X[j, k]:
                    d += 1.
            D[ind] = d
            ind += 1

    return D

def pdist_with_categorial(vectors, types):
    where_float_type = types == 'float64'
    # calculate the squared distance of the float values
    distances_squared = pdist(vectors[:, where_float_type], metric='sqeuclidean')
    # sum the number of mismatched categorials and add that to the distances 
    # and then take the square root
    return np.sqrt(distances_squared + categorial_sum(vectors[:, ~where_float_type]))

它不会显着加快，但它大大简化了 numba 函数中的逻辑。

然后您还可以通过将平方距离传递给 numba 函数来避免创建额外的数组：

@nb.njit
def add_categorial_sum_and_sqrt(X, D):
    m = X.shape[0]
    n = X.shape[1]
    ind = 0
    for i in range(m):
        for j in range(i+1, m):
            d = 0.0
            for k in range(n):
                if X[i, k] != X[j, k]:
                    d += 1.
            D[ind] = np.sqrt(D[ind] + d)
            ind += 1

    return D

def pdist_with_categorial(vectors, types):
    where_float_type = types == 'float64'
    distances_squared = pdist(vectors[:, where_float_type], metric='sqeuclidean')
    return add_categorial_sum_and_sqrt(vectors[:, ~where_float_type], distances_squared)

混合数值和分类数据观测值之间成对距离计算的有效实现

Efficient implementation of pairwise distances computation between observations for mixed numeric and categorical data

python

numpy

categorical-data

numba

pairwise-distance