最近邻搜索 4D space python 快速 - 矢量化
nearest neighbour search 4D space python fast - vectorization
对于 X 中的每个观察值(有 20 个),我想获得 k(3) 个最近的邻居。
如何快速支持多达 3 到 4 百万行?
是否可以加快循环遍历元素的速度?也许通过 numpy、numba 或某种矢量化?
python 中的一个简单循环:
import numpy as np
from sklearn.neighbors import KDTree
n_points = 20
d_dimensions = 4
k_neighbours = 3
rng = np.random.RandomState(0)
X = rng.random_sample((n_points, d_dimensions))
print(X)
tree = KDTree(X, leaf_size=2, metric='euclidean')
for element in X:
print('********')
print(element)
# when simply using the first row
#element = X[:1]
#print(element)
# potential optimization: query_radius https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius
dist, ind = tree.query([element], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)
# indices of 3 closest neighbors
print(ind)
#[[0 9 1]] !! includes self (element that was searched for)
print(dist) # distances to 3 closest neighbors
#[[0. 0.38559188 0.40997835]] !! includes self (element that was searched for)
# actual returned elements for index:
print(X[ind])
## after removing self
print(X[ind][0][1:])
最佳输出是具有以下结构的 pandas.DataFrame:
lat_1,long_1,lat_2,long_2,neighbours_list
0.5488135,0.71518937,0.60276338,0.54488318, [[0.61209572 0.616934 0.94374808 0.6818203 ][0.4236548 0.64589411 0.43758721 0.891773]
编辑
现在,我有一个基于 pandas 的实现:
df = df.dropna() # there are sometimes only parts of the tuple (either left or right) defined
X = df[['lat1', 'long1', 'lat2', 'long2']]
tree = KDTree(X, leaf_size=4, metric='euclidean')
k_neighbours = 3
def neighbors_as_list(row, index, complete_list):
dist, ind = index.query([[row['lat1'], row['long1'], row['lat2'], row['long2']]], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)
return complete_list.values[ind][0][1:]
df['neighbors'] = df.apply(neighbors_as_list, index=tree, complete_list=X, axis=1)
df.head()
但这很慢。
编辑 2
当然,这是一个 pandas 版本:
import numpy as np
import pandas as pd
from sklearn.neighbors import KDTree
from scipy.spatial import cKDTree
rng = np.random.RandomState(0)
#n_points = 4_000_000
n_points = 20
d_dimensions = 4
k_neighbours = 3
X = rng.random_sample((n_points, d_dimensions))
X
df = pd.DataFrame(X)
df = df.reset_index(drop=False)
df.columns = ['id_str', 'lat_1', 'long_1', 'lat_2', 'long_2']
df.id_str = df.id_str.astype(object)
display(df.head())
tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])
dist,ind=tree.query(X, k=k_neighbours,n_jobs=-1)
display(dist)
print(df[['lat_1', 'long_1', 'lat_2', 'long_2']].shape)
print(X[ind_out].shape)
X[ind_out]
# fails with
# AssertionError: Shape of new values must be compatible with manager shape
df['neighbors'] = X[ind_out]
df
但它失败了,因为我无法重新分配结果。
您可以使用 scipy 的 cKdtree。
例子
rng = np.random.RandomState(0)
n_points = 4_000_000
d_dimensions = 4
k_neighbours = 3
X = rng.random_sample((n_points, d_dimensions))
tree = cKDTree(X)
#%timeit tree = cKDTree(X)
#3.74 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#%%timeit
_,ind=tree.query(X, k=k_neighbours,n_jobs=-1)
#shape=(4000000, 2)
ind_out=ind[:,1:]
#shape=(4000000, 2, 4)
coords_out=X[ind_out].shape
#7.13 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
这种规模的问题,11s左右已经很不错了。
对于 X 中的每个观察值(有 20 个),我想获得 k(3) 个最近的邻居。 如何快速支持多达 3 到 4 百万行? 是否可以加快循环遍历元素的速度?也许通过 numpy、numba 或某种矢量化?
python 中的一个简单循环:
import numpy as np
from sklearn.neighbors import KDTree
n_points = 20
d_dimensions = 4
k_neighbours = 3
rng = np.random.RandomState(0)
X = rng.random_sample((n_points, d_dimensions))
print(X)
tree = KDTree(X, leaf_size=2, metric='euclidean')
for element in X:
print('********')
print(element)
# when simply using the first row
#element = X[:1]
#print(element)
# potential optimization: query_radius https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius
dist, ind = tree.query([element], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)
# indices of 3 closest neighbors
print(ind)
#[[0 9 1]] !! includes self (element that was searched for)
print(dist) # distances to 3 closest neighbors
#[[0. 0.38559188 0.40997835]] !! includes self (element that was searched for)
# actual returned elements for index:
print(X[ind])
## after removing self
print(X[ind][0][1:])
最佳输出是具有以下结构的 pandas.DataFrame:
lat_1,long_1,lat_2,long_2,neighbours_list
0.5488135,0.71518937,0.60276338,0.54488318, [[0.61209572 0.616934 0.94374808 0.6818203 ][0.4236548 0.64589411 0.43758721 0.891773]
编辑
现在,我有一个基于 pandas 的实现:
df = df.dropna() # there are sometimes only parts of the tuple (either left or right) defined
X = df[['lat1', 'long1', 'lat2', 'long2']]
tree = KDTree(X, leaf_size=4, metric='euclidean')
k_neighbours = 3
def neighbors_as_list(row, index, complete_list):
dist, ind = index.query([[row['lat1'], row['long1'], row['lat2'], row['long2']]], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)
return complete_list.values[ind][0][1:]
df['neighbors'] = df.apply(neighbors_as_list, index=tree, complete_list=X, axis=1)
df.head()
但这很慢。
编辑 2
当然,这是一个 pandas 版本:
import numpy as np
import pandas as pd
from sklearn.neighbors import KDTree
from scipy.spatial import cKDTree
rng = np.random.RandomState(0)
#n_points = 4_000_000
n_points = 20
d_dimensions = 4
k_neighbours = 3
X = rng.random_sample((n_points, d_dimensions))
X
df = pd.DataFrame(X)
df = df.reset_index(drop=False)
df.columns = ['id_str', 'lat_1', 'long_1', 'lat_2', 'long_2']
df.id_str = df.id_str.astype(object)
display(df.head())
tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])
dist,ind=tree.query(X, k=k_neighbours,n_jobs=-1)
display(dist)
print(df[['lat_1', 'long_1', 'lat_2', 'long_2']].shape)
print(X[ind_out].shape)
X[ind_out]
# fails with
# AssertionError: Shape of new values must be compatible with manager shape
df['neighbors'] = X[ind_out]
df
但它失败了,因为我无法重新分配结果。
您可以使用 scipy 的 cKdtree。
例子
rng = np.random.RandomState(0)
n_points = 4_000_000
d_dimensions = 4
k_neighbours = 3
X = rng.random_sample((n_points, d_dimensions))
tree = cKDTree(X)
#%timeit tree = cKDTree(X)
#3.74 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#%%timeit
_,ind=tree.query(X, k=k_neighbours,n_jobs=-1)
#shape=(4000000, 2)
ind_out=ind[:,1:]
#shape=(4000000, 2, 4)
coords_out=X[ind_out].shape
#7.13 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
这种规模的问题,11s左右已经很不错了。