测量数据帧的行之间的距离

Question

我有一个包含 472 行和 32 列的数据框，它看起来像这样：

2   3   0   4   2   0   0   5   2   3   3   3   2   0   5   5   3   3   3   2   2   0   2   5   3   3   3   2   2   2   0   5
2   3   0   4   2   0   0   5   2   3   3   3   2   0   5   5   3   3   3   2   2   0   2   5   3   3   3   2   2   2   0   5
2   3   0   4   2   0   0   5   2   3   3   3   2   0   5   5   3   3   3   2   2   0   2   5   3   3   3   2   2   2   0   5

在这里，每一行代表一个人的32颗牙齿，0-5之间的每个数字代表不同的牙齿类别。现在我想通过使用不同的距离度量（例如 MANHATTAN、EUCLID、MINKOWSKI）来测量任意两行之间的距离。所以，差异越小，他们就越有可能是同一个人等

*如果我在计算这些指标之前应用 ONE-HOT-ENCODING，每行将有超过 32 列，这对我来说毫无用处。

*我也找到了 cdist and pdist，但是这些函数给了我逐元素的距离结果。但我想要的是在任意两行之间获得“单一结果”。

我是在尝试一些无意义的事情还是我应该怎么做才能计算出这些距离？

Answer 1

您似乎要找的距离计算函数如下：

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

您可以将指标设置为用于 scipy.spatial.distance.pdist 的任何指标。

工作原理示例：

a = [[1,2,3,4,5,6,7,8,10]]
b = [[2,4,1,3,4,5,6,7,8]]
c = [[4,2,1,54,7,85,89,1,2]]

from sklearn.metrics import pairwise_distances

pairwise_distances(a,b)

输出将是：

array([[4.24264069]])

同样，

的输出

pairwise_distances(a,c)

将是：

array([[124.87994234]])

因此，c离a更远。

你可以在你的问题中使用这个逻辑。在您的情况下，以下代码片段可以解决问题：

import pandas as pd
import numpy as np

df = pd.read_csv('your_file.csv')
for i, row in df.iterrows():
    row = np.array(row)
    for j, other_row in df.iterrows():
       other_row = np.array(other_row)
       distance = pairwise_distances(np.reshape(row,(1,len(row))),np.reshape(other_row,(1,len(other_row))))
       print("Distance between row {} and {} : {}".format(i,j,distance))

测量数据帧的行之间的距离

measuring the distance between rows of a dataframe

python

dataframe

pandas

one-hot-encoding