来自 LSHForest 的示例,结果不令人信服
example from LSHForest, results not convinced
库和相应的文档如下 - 是的,我阅读了所有内容并能够 "run" 我自己的代码。
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html
但结果对我来说并没有真正意义,所以我通过了示例(也包含在以前的网页中)
>>> from sklearn.neighbors import LSHForest
>>> X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
>>> X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]
>>> lshf = LSHForest()
>>> lshf.fit(X_train)
LSHForest(min_hash_match=4, n_candidates=50, n_estimators=10,
n_neighbors=5, radius=1.0, radius_cutoff_ratio=0.9,
random_state=None)
>>> distances, indices = lshf.kneighbors(X_test, n_neighbors=2)
>>> distances
array([[ 0.069..., 0.149...],
[ 0.229..., 0.481...],
[ 0.004..., 0.014...]])
>>> indices
array([[1, 2],
[2, 0],
[4, 0]])
所以我只是尝试通过为三个测试集 [9, 1, 6], [3, 1, 10], [7, 10, 3][=13] 找到最近的邻居来验证示例=]
假设搜索 [9,1,6] 最近的邻居(通过使用欧氏距离),最近的训练点是 [5, 5, 2] 和 [6, 10, 2] (我认为索引would [0.4]) -- 这与结果 [1,2]
有很大不同
距离也完全跑题了,我的excelsheet是attached
再次感谢您的宝贵时间和帮助
没错,因为 LSHForest 实现了 ANN(近似近邻),也许这就是我们需要考虑的差异。 ANN 结果不是最近的邻居,而是最近邻居应该是什么的近似值。
例如,2 个最近邻结果如下所示:
from sklearn.neighbors import NearestNeighbors
X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X_train)
distances, indices = nbrs.kneighbors(X_test)
和returns
indices
Out[2]:
array([[0, 2],
[0, 2],
[4, 3]], dtype=int64)
distances
Out[3]:
array([[ 6.92820323, 9.43398113],
[ 9.16515139, 9.21954446],
[ 1.41421356, 2.44948974]])
如果有帮助,请查看 this 并注意它提到:
given a query point q, if there exists a point within distance r from q, then it reports a point within distance cr from q. Here c is the approximation factor of the algorithm.
距离 'r' 处的点和返回的点不必相同。
希望这对您有所帮助。
库和相应的文档如下 - 是的,我阅读了所有内容并能够 "run" 我自己的代码。
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html
但结果对我来说并没有真正意义,所以我通过了示例(也包含在以前的网页中)
>>> from sklearn.neighbors import LSHForest
>>> X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
>>> X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]
>>> lshf = LSHForest()
>>> lshf.fit(X_train)
LSHForest(min_hash_match=4, n_candidates=50, n_estimators=10,
n_neighbors=5, radius=1.0, radius_cutoff_ratio=0.9,
random_state=None)
>>> distances, indices = lshf.kneighbors(X_test, n_neighbors=2)
>>> distances
array([[ 0.069..., 0.149...],
[ 0.229..., 0.481...],
[ 0.004..., 0.014...]])
>>> indices
array([[1, 2],
[2, 0],
[4, 0]])
所以我只是尝试通过为三个测试集 [9, 1, 6], [3, 1, 10], [7, 10, 3][=13] 找到最近的邻居来验证示例=]
假设搜索 [9,1,6] 最近的邻居(通过使用欧氏距离),最近的训练点是 [5, 5, 2] 和 [6, 10, 2] (我认为索引would [0.4]) -- 这与结果 [1,2]
有很大不同距离也完全跑题了,我的excelsheet是attached
再次感谢您的宝贵时间和帮助
没错,因为 LSHForest 实现了 ANN(近似近邻),也许这就是我们需要考虑的差异。 ANN 结果不是最近的邻居,而是最近邻居应该是什么的近似值。
例如,2 个最近邻结果如下所示:
from sklearn.neighbors import NearestNeighbors
X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X_train)
distances, indices = nbrs.kneighbors(X_test)
和returns
indices
Out[2]:
array([[0, 2],
[0, 2],
[4, 3]], dtype=int64)
distances
Out[3]:
array([[ 6.92820323, 9.43398113],
[ 9.16515139, 9.21954446],
[ 1.41421356, 2.44948974]])
如果有帮助,请查看 this 并注意它提到:
given a query point q, if there exists a point within distance r from q, then it reports a point within distance cr from q. Here c is the approximation factor of the algorithm.
距离 'r' 处的点和返回的点不必相同。
希望这对您有所帮助。