如何(巧妙地)遍历 GeoDataframe 中的所有点并查看最近的邻居

How to (smartly) loop over all points in a GeoDataframe and look at nearest neighbours

我有一个大的(O(10^6) 行)数据集(带值的点),我需要对所有点执行以下操作:

"non-vectorised" 方法是简单地遍历所有点...对于所有点,然后应用逻辑。然而,这种扩展性很差。

我已经包含了一个玩具示例,它可以满足我的要求。我已经考虑过的想法是:

这是我要实现的逻辑的玩具示例:

import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp

points=[
    'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
    'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
    'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]

df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})

for index,row in gdf.iterrows(): # Looping over all points
    gdf['dist'] = np.nan
    for index2,row2 in gdf.iterrows(): # Looping over all the other points
        if index==index2: continue
        d=row['geometry'].distance(row2['geometry']) # Calculate distance
        if d<3: gdf.at[index2,'dist']=d # If within cutoff: Store
        else: gdf.at[index2,'dist']=np.nan # Otherwise, be paranoid and leave NAN
    # Calculating mean of values for the 3 nearest points and storing 
    gdf.at[index,'mean']=np.mean(gdf.sort_values('dist').head(3)['values'].tolist())

print(gdf)

生成的 GeoDataframe 在这里:

          points  values       geometry      dist      mean
0  POINT (1 1.1)       9  POINT (1 1.1)  2.758623  6.333333
1  POINT (1 1.9)       8  POINT (1 1.9)  2.282542  7.000000
2  POINT (1 3.1)       7  POINT (1 3.1)  2.002498  5.666667
3    POINT (2 1)       6    POINT (2 1)  2.236068  5.666667
4  POINT (2 2.1)       5  POINT (2 2.1)  1.345362  4.666667
5  POINT (2 2.9)       4  POINT (2 2.9)  1.004988  4.333333
6  POINT (3 0.8)       3  POINT (3 0.8)  2.200000  4.333333
7    POINT (3 2)       2    POINT (3 2)  1.000000  3.000000
8    POINT (3 3)       1    POINT (3 3)       NaN  3.666667

可以看到上次迭代的状态

如何以更具可扩展性的方式执行此操作?

我会为此使用空间索引。您可以使用 libpysal 的功能,它在后台使用 KDTree。对于 2000 个随机点,以下代码运行 3.5 秒,而你的代码运行了很长时间(第一分钟后我就失去了耐心)。将值保存到列表,然后将列表转换到 DF 的列中,也可以节省一些时间。

import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp
import libpysal

points=[
    'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
    'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
    'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]

df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})

knn3 = libpysal.weights.KNN.from_dataframe(gdf, k=3)

means = []
for index, row in gdf.iterrows(): # Looping over all points
    knn_neighbors = knn3.neighbors[index]
    knnsubset = gdf.iloc[knn_neighbors]
    neighbors = []
    for ix, r in knnsubset.iterrows():
        if r.geometry.distance(row.geometry) < 3: # max distance here
            neighbors.append(ix)

    subset = gdf.iloc[list(neighbors)]
    means.append(np.mean(subset['values']))
gdf['mean'] = means

这是结果:

          points  values       geometry      mean
0  POINT (1 1.1)       9  POINT (1 1.1)  6.333333
1  POINT (1 1.9)       8  POINT (1 1.9)  7.000000
2  POINT (1 3.1)       7  POINT (1 3.1)  5.666667
3    POINT (2 1)       6    POINT (2 1)  5.666667
4  POINT (2 2.1)       5  POINT (2 2.1)  4.666667
5  POINT (2 2.9)       4  POINT (2 2.9)  4.333333
6  POINT (3 0.8)       3  POINT (3 0.8)  4.333333
7    POINT (3 2)       2    POINT (3 2)  3.000000
8    POINT (3 3)       1    POINT (3 3)  3.666667