计算两个不同数据帧中两列的 cdist 的问题

Question

我正在尝试使用 scipy.spatial.distance 中的 cdist 来计算两个 pandas 数据帧中的向量之间的距离，但输出都是错误的，我无法确定失败的地方.

因此，我的原始数据帧类型为：

df_sample = 
                                             Fingerprint
1272    [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
657    [1.44, 12.0, 10.0, 5.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.23, 4.36, 15.0]
806   [4.58, 13.09, 15.46, 3.59, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 6.31]

和

DF = 
  barcode  \
4538   A4060462000516278   
5043   A4050494272716275   
11663  A4070271111316245   
2701   A4060462848716270   
825    A4060454573516274   
8679   A4060462010016274   
11700  A4060462080916270   
8594   A4060461067716272   
8707   A4060454363916275   
1071   A4060463723916275   

                                                                                                                                    Geopos Ack  
4538     [0.0, 0.0, 0.0, 0.0, 6.0, 15.0, 16.0, 0.0, 0.0, 5.0, 0.0, 15.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5, 0.0, 3.0]  
5043   [0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 12.0, 0.0, 13.0, 15.0, 0.0, 15.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 0.0]  
11663      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 15.0, 0.0, 0.0, 0.0, 6.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
2701      [0.0, 0.0, 0.0, 8.0, 13.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 0.0, 7.0]  
825     [0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 15.0, 0.0, 13.0, 16.0, 0.0, 9.0, 3.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
8679      [0.0, 4.0, 9.0, 15.0, 10.0, 3.0, 2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 9.0]  
11700     [0.0, 0.0, 6.0, 0.0, 15.0, 8.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 6.0]  
8594     [12.0, 16.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 5.0]  
8707       [7.0, 5.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0, 15.0]  
1071      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

（我在问题的最后提供了两者的字典）。

如您所见，它们具有不同的维度（尽管向量属于同一 space）。因此，为了解决这个问题，我通过这样做在 df_sample 中创建了零向量：

Number_AP = 26
number_zero_vectors = len(DF)-len(df_sample)
df =pd.DataFrame(columns = ['Fingerprint'])
for k in range(number_zero_vectors):
    a = zerolistmaker(Number_AP)
    df = df.append({'Fingerprint':a},ignore_index=True)

df_sample_ = pd.concat([df_sample, df])

因此，DF和df_sample_具有相同的形状。但是，dtype och df_sample_['Fingerprint'] 和 DF['Geopos Ack'] 都是 object，即它们都是列表。所以，我需要把它们做成数组。结果是数组数组：

Ax = df_sample_['Fingerprint'] = df_sample_['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF['Geopos Ack'] = DF['Geopos Ack'].apply(lambda x: np.array(x))

因此我需要 1) 将它们制成（向量的）数组和 2) 确保它们具有相同的形状以使用 cdist、

A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.concatenate(A, axis=0).reshape(-1,1)
BB = np.concatenate(B, axis=0).reshape(-1,1)

简而言之，我想计算每对向量 (a, b) 之间的距离，其中 a 是 A 中的向量，b 是 B 中的向量。

例如：

A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = [[1, 2, 2^0.5], [1, 2^0.5, 2]]

因此，为了计算距离，我使用以下完整代码：

import scipy.spatial.distance as sp

Ax = df_sample_['Fingerprint'] = df_sample_['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF['Geopos Ack'] = DF['Geopos Ack'].apply(lambda x: np.array(x))

A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.concatenate(A, axis=0).reshape(-1,1)
BB = np.concatenate(B, axis=0).reshape(-1,1)


d = sp.cdist(AA,BB, 'euclidean')

但是这个returns

array([[0., 0., 0., ..., 0., 0., 0.],
       [4., 4., 4., ..., 4., 4., 4.],
       [8., 8., 8., ..., 8., 8., 8.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

这是 df_sample_.

中所有数组的串联

我哪里错了？我知道另一种方法是使用 sklearn 中的 pairwise_distance 但我没有设法将它应用到我的数据帧。

如有任何帮助，我们将不胜感激。

数据:

df_sample = 
{'Fingerprint': {1272: [0.0,
   4.0,
   8.0,
   15.0,
   10.0,
   8.0,
   2.54,
   2.0,
   4.91,
   0.0,
   0.0,
   0.0,
   0.0,
   3.59,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   8.0],
  657: [1.44,
   12.0,
   10.0,
   5.0,
   6.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   8.23,
   4.36,
   15.0],
  806: [4.58,
   13.09,
   15.46,
   3.59,
   3.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   0.0,
   6.31]}}

和

DF = 
{'barcode': {4538: 'A4060462000516278',
  5043: 'A4050494272716275',
  11663: 'A4070271111316245',
  2701: 'A4060462848716270',
  825: 'A4060454573516274',
  8679: 'A4060462010016274',
  11700: 'A4060462080916270',
  8594: 'A4060461067716272',
  8707: 'A4060454363916275',
  1071: 'A4060463723916275'},
 'Geopos Ack': {4538: [0.0,
   0.0,
   0.0,
   0.0,
   6.0,
   15.0,
   16.0,
   0.0,
   0.0,
   5.0,
   0.0,
   15.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   3.5,
   0.0,
   3.0],
  5043: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   16.0,
   12.0,
   0.0,
   13.0,
   15.0,
   0.0,
   15.0,
   0.0,
   0.0,
   6.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   3.0,
   3.0,
   0.0],
  11663: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   5.0,
   15.0,
   0.0,
   0.0,
   0.0,
   6.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0],
  2701: [0.0,
   0.0,
   0.0,
   8.0,
   13.0,
   16.0,
   6.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   6.0,
   0.0,
   7.0],
  825: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   11.0,
   15.0,
   0.0,
   13.0,
   16.0,
   0.0,
   9.0,
   3.0,
   0.0,
   6.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0],
  8679: [0.0,
   4.0,
   9.0,
   15.0,
   10.0,
   3.0,
   2.0,
   0.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   9.0],
  11700: [0.0,
   0.0,
   6.0,
   0.0,
   15.0,
   8.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   16.0,
   0.0,
   6.0],
  8594: [12.0,
   16.0,
   16.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   8.0,
   0.0,
   5.0],
  8707: [7.0,
   5.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   8.0,
   15.0],
  1071: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   12.0,
   15.5,
   6.0,
   3.5,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0]}}

Answer 1

如 scipy.spatial.distance's docs 中所述，XA 和 XB 应该是向量的列表，您要计算其中一个向量与另一个向量的距离。您在代码中所做的是从所有向量中创建一个长向量，并在我认为您必须做的是堆叠它们时比较它们。虽然你的问题中你的确切意图并不明确，所以我可能是错的。

import pandas as pd
import numpy as np
import scipy.spatial.distance as sp

# df_sample and DF are OP's dictionaries
df_sample_df = pd.DataFrame(df_sample)
DF_df = pd.DataFrame(DF)

Ax = df_sample_df['Fingerprint'] = df_sample_df['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF_df['Geopos Ack'] = DF_df['Geopos Ack'].apply(lambda x: np.array(x))

A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.stack(A)
BB = np.stack(B)


d = sp.cdist(AA,BB, 'euclidean')
print(f'd.shape = {d.shape}')
print(f'd[0, 0] = {d[0, 0]}')
print(f'L2(AA[0],BB[0]) = {np.sum((AA[0] - BB[0])**2)**0.5}')

输出：

d.shape = (3, 10)
d[0, 0] = 34.57536840006191
L2(AA[0],BB[0]) = 34.57536840006192

为了让你的问题更清楚，你可以解释你想要计算的距离是多少，以及添加一个 MINIMAL 可重现的例子。如：

"我想找出每对向量 (a, b) 之间的距离，其中 a 是 A 中的向量，b 是 B 中的向量。
A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = [[1, 2, 2^0.5], [1, 2^0.5, 2]] “

或者：

"我想求填充矩阵A和矩阵B之差的Frobenius范数
A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = 8^0.5 “

计算两个不同数据帧中两列的 cdist 的问题

Problems computing cdist of two columns in two different dataframes

python

numpy

scipy

pandas