sklearn k最近邻居问题
Issue with sklearn k nearest neighbors
我想知道是否有一种方法可以强制 sklearn NearestNeighbors 算法,以在存在重复点时考虑输入数组中点的顺序。
举例说明:
>>> from sklearn.neighbors import NearestNeighbors
>>> import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
indices
>>>> array([[0, 1],
[1, 0],
[2, 1],
[3, 4],
[4, 3],
[5, 4]])
因为查询集与训练集相匹配,所以每个点的最近邻就是点本身,距离为零。但是,如果我允许 X 中有重复点,则可以理解,该算法不会区分重复点:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1],[3, 2],[-1,-1],[-1,-1]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto').fit(X)
distances, indices = nbrs.kneighbors(X)
indices
>>>> array([[6, 0],
[1, 0],
[2, 1],
[3, 4],
[4, 3],
[5, 4],
[6, 0],
[6, 0]])
理想情况下,我希望最后的输出类似于:
>>>> array([[0, 6],
[1, 0],
[2, 1],
[3, 4],
[4, 3],
[5, 4],
[6, 0],
[7, 6]])
我认为你不能那样做,因为从 ref 我们得到:
Warning: Regarding the Nearest Neighbors algorithms, if two neighbors,
neighbor k+1 and k, have identical distances but different labels, the
results will depend on the ordering of the training data.
我想知道是否有一种方法可以强制 sklearn NearestNeighbors 算法,以在存在重复点时考虑输入数组中点的顺序。
举例说明:
>>> from sklearn.neighbors import NearestNeighbors
>>> import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
indices
>>>> array([[0, 1],
[1, 0],
[2, 1],
[3, 4],
[4, 3],
[5, 4]])
因为查询集与训练集相匹配,所以每个点的最近邻就是点本身,距离为零。但是,如果我允许 X 中有重复点,则可以理解,该算法不会区分重复点:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1],[3, 2],[-1,-1],[-1,-1]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto').fit(X)
distances, indices = nbrs.kneighbors(X)
indices
>>>> array([[6, 0],
[1, 0],
[2, 1],
[3, 4],
[4, 3],
[5, 4],
[6, 0],
[6, 0]])
理想情况下,我希望最后的输出类似于:
>>>> array([[0, 6],
[1, 0],
[2, 1],
[3, 4],
[4, 3],
[5, 4],
[6, 0],
[7, 6]])
我认为你不能那样做,因为从 ref 我们得到:
Warning: Regarding the Nearest Neighbors algorithms, if two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.