
how to compute pairwise distance among series of different length (na inside) efficiently?

恢复这个问题Compute the pairwise distance in scipy with missing values



import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist

a = pd.DataFrame(np.random.rand(10, 4), columns=['a','b','c','d'])
a.loc[0, 'a'] = np.nan
a.loc[1, 'a'] = np.nan
a.loc[0, 'c'] = np.nan
a.loc[1, 'c'] = np.nan

def dropna_on_the_fly(x, y):
    return  np.sqrt(np.nansum(((x-y)**2)))

pdist(starting_set, dropna_on_the_fly)

但我觉得这可能非常低效,因为 pdist 函数的内置方法在内部进行了优化,而函数只是被传递过来。

我有一种预感 numpy 中的矢量化解决方案,为此我 broadcast 减法,然后我继续 np.nansum 以获得 na 阻力总和,但我我不确定如何进行。



ar = a.values
r,c = np.triu_indices(ar.shape[0],1)
out = np.sqrt(np.nansum((ar[r] - ar[c])**2,1))

方法 #2: 对于大型数组,内存效率更高且性能更高的方法是 -

ar = a.values
b = np.where(np.isnan(ar),0,ar)

mask = ~np.isnan(ar)
n = b.shape[0]
N = n*(n-1)//2
idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
out = np.empty((N),dtype=b.dtype)
for j,i in enumerate(range(n-1)):
    dif = b[i,None] - b[i+1:]
    mask_j = (mask[i] & mask[i+1:])
    masked_vals = mask_j * dif
    out[start[j]:stop[j]] = np.einsum('ij,ij->i',masked_vals, masked_vals)
      # or simply : ((mask_j * dif)**2).sum(1)

out = np.sqrt(out)