Sklearn：具有字符串值和自定义度量的最近邻

Question

我的数据如下所示（均为字符串值）

>>> all_states[0:3]
[['A','B','Empty'],
 ['A', 'B', 'Empty'],
 ['C', 'D', 'Empty']]

我想使用自定义距离度量

def mydist(x, y):
    return 1
neigh = NearestNeighbors(n_neighbors=5, metric=mydist)

然而，当我打电话时

neigh.fit(np.array(all_states))

我收到错误

ValueError：无法使用 dtype='numeric'

将 bytes/strings 的数组转换为十进制数

我知道我可以使用 OneHotEncoder 或 LabelEncoder - 但我是否也可以不对数据进行编码，因为我有自己的距离度量？

Answer 1

据我所知，ML 模型需要在数值数据上进行训练。如果您的距离度量具有将字符串转换为数字的方法，那么它将起作用。

Answer 2

在help page,

metrics tr or callable, default=’minkowski’

The distance metric to usefor the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of DistanceMetric for a list of available metrics. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.

您可以使用 pdist documentation 并根据输入要求将其制成方形：

all_states = [['A','B','Empty'],
 ['A', 'B', 'Empty'],
 ['C', 'D', 'Empty']]

from scipy.spatial.distance import pdist,squareform
from sklearn.neighbors import NearestNeighbors

dm = squareform(pdist(all_states, mydist))
dm

array([[0., 1., 1.],
       [1., 0., 1.],
       [1., 1., 0.]])

neigh = NearestNeighbors(n_neighbors=5, metric="precomputed")  
neigh.fit(dm)

Sklearn：具有字符串值和自定义度量的最近邻

Sklearn: Nearest Neightbour with String-Values and Custom Metric

python

knn

scikit-learn