两个 GeoDataFrame 的有效最近邻匹配

Question

我正在对两个数据帧（100k 行和 1M 行）进行一些距离处理。我的处理目前需要 20 天，我想看看我是否可以改进我的代码以加快流程。我在这里的建议后使用了 geopandas，这大大加快了我迭代中的排序，但我想知道我是否可以用不同的方式编码或一些最佳实践。

这是我的表格：

dfb

            osm_id              name   type  ...       surf                                    centro valeurseuil
0       579194418              <NA>    yes  ...  31.698915  [450755.82645944477, 1751747.6864094366]    9.529465
1       579207356              <NA>    yes  ...  13.176170   [451636.3912408436, 1749870.2332154014]    6.143854
2       579211060              <NA>  ruins  ...  38.139254   [452363.0241478424, 1748257.9797045246]   10.452795
3       369649074              <NA>    yes  ...  48.625159   [451683.6453331233, 1748501.9621812948]   11.802577
4       267028254  REFUGE DU MAUPAS    yes  ...  56.936793    [453183.133164252, 1747379.3537066604]   12.771527
...           ...               ...    ...  ...        ...                                       ...         ...
935472  252424617               NaN    yes  ...  20.135543    [562867.4086043949, 1835680.070006147]    7.595004
935473  252424662               NaN    yes  ...  26.000756   [562843.7024093983, 1835627.7568757713]    8.630567
935474  252424658               NaN    yes  ...  26.933184   [562823.0171366152, 1835635.2213349422]    8.783956
935475  252424810               NaN    yes  ...  40.663507   [562827.2984292071, 1835603.3674661077]   10.793163
935476  252424878               NaN    yes  ...  49.335093   [562822.9908286823, 1835585.4379250652]   11.888424

[935477 rows x 9 columns]

gdfo

     ogc_fid                        id  code_cs  ...                                        coordpoints          surf                                    centro
0           3  OCSGE0000000000000741775  CS2.2.1  ...  [[495777.7174760493, 1805519.5430061778], [495...   1174.564617   [495632.1007243945, 1805463.0834386032]
1          16  OCSGE0000000000000947172  CS2.2.1  ...  [[544263.7691919031, 1824731.9681054198], [544...  20054.686775   [544368.7095351141, 1824617.8708354477]
2          25  OCSGE0000000000000949293  CS2.2.1  ...  [[535161.228444845, 1844915.1013712562], [5351...    295.911740   [535212.0861638768, 1844894.4575003278]
3          29  OCSGE0000000000000947839  CS2.2.1  ...  [[533186.6035670156, 1837867.7088815654], [533...   3466.870293   [533173.6347083747, 1837936.4649687177]
4      193406  OCSGE0000000000000739484  CS2.2.1  ...  [[458053.7764636817, 1757545.0501438894], [458...   4424.495046  [457942.74975193664, 1757488.5605310446]
...       ...                       ...      ...  ...                                                ...           ...                                       ...
83870  393015  OCSGE0000000000000807891  CS2.2.1  ...  [[513245.68605544185, 1819995.2010655974], [51...   3416.327411     [513269.2562117624, 1819960.69636371]
83871  393050  OCSGE0000000000000176585  CS2.2.1  ...  [[483728.63284117245, 1781422.0428754487], [48...      0.032123   [483713.9059494421, 1781392.2606697257]
83872  393057  OCSGE0000000000000813649  CS2.2.1  ...  [[516662.97000782896, 1860487.2357337545], [51...    722.841230  [516719.98876274703, 1860521.6746725072]
83873  393062  OCSGE0000000000000954112  CS2.2.1  ...  [[543018.616240293, 1832845.9711751717], [5430...    481.191268   [543013.2243556273, 1832823.7731046807]
83874  393071  OCSGE0000000000001016440  CS2.2.1  ...  [[530307.8027104639, 1843478.1113854842], [530...     88.841634    [530310.0204813549, 1843428.549428356]
[83875 rows x 9 columns]

这是我的代码：

    dfb=pd.read_csv(building,sep='#')
    dfo=pd.read_csv(occsol,sep='#')
    dfb['geot']='non'
    gs = gpd.GeoSeries.from_wkt(dfo['geometry'], crs='EPSG:27572')
    gdfo=gpd.GeoDataFrame(dfo,geometry=gs)
    dfb['valeurseuil'] = 3 * ((dfb['surf'] / 3.141592653589793) ** (1 / 2))  # this is a treshold
    m = 0
    fin = len(dfb)
    for i in range(len(dfb)):  
        gdfo['dist']=gdfo['geometry'].distance(Point(dfb.iloc[i]['centro'][0],dfb.iloc[i]['centro'][1]))
        gdfo = gdfo.sort_values(by='dist')
        for j in range(2):  # 3 first polygons sorted by ascending distance
            XYPtj = gdfo.iloc[j]['coordpoints']
            compteur = 0
            temp = []
            for l in XYPtj:
                dist = self.distancepoint([dfb.iloc[i]['centro'], l])
                temp.append(dist)
            for d in temp:
                if d < dfb.iloc[i]['valeurseuil']:  #treshold
                    compteur += 1  #
            if compteur >= 2 and gdfo.iloc[j]['surf'] >= 1.5*dfb.iloc[i]['surf']:
                gdfo.iloc[j]['surf']-= 1.5*dfb.iloc[i]['surf']  # on enleve du potentiel à hypothese 1:1.5
                dfb.iloc[i]['geot'] = 'oui'
        m += 1
        print('avancement : '+ str(m) + '/ ' + str(fin))
    dfb.to_csv('buildingeot',sep='#', index=False)

def distancepoint(self, xy):
    "distance euclidienne,le systeme de coordonnees nest pas precise"
    if self.valeursabs(xy[0][0] - xy[1][0]) < 100000 and self.valeursabs(
            xy[0][1] - xy[1][1]) < 100000:  ##verifier limit
        d = ((xy[1][1] - xy[0][1]) ** 2 + (xy[1][0] - xy[0][0]) ** 2) ** (1 / 2)
    else:
        d = 666666
    return d

Answer 1

特别是对于这种规模的问题，值得寻找矢量化算法。对于像这样的多对多匹配问题，numpy 和 scipy 提供了许多算法，这些算法的性能将大大优于 pandas groupby 或循环选项，以至于您自己管理索引所需的额外努力是通常值得麻烦。

有很多方法可以解决这个问题，但如果您的目标只是使用欧几里德近似值找到最近点，那么没有比 scipy.spatial.cKDTree 更简单的方法了。以下代码在第二个数据集（100 万行）中查找最接近第一个数据集中 100k 个点的点的位置索引，并在 ~20 秒内运行：

In [1]: import geopandas as gpd, numpy as np, scipy.spatial, pandas as pd

In [2]: # set up GeoDataFrame with 100k random points
   ...: gdf1 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(
   ...:     (np.random.random(size=int(1e5)) * 360 - 180),
   ...:     (np.random.random(size=int(1e5)) * 180 - 90),
   ...: ))

In [3]: # set up GeoDataFrame with 1M random points
   ...: gdf2 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(
   ...:     (np.random.random(size=int(1e6)) * 360 - 180),
   ...:     (np.random.random(size=int(1e6)) * 180 - 90),
   ...: ))


In [4]: %%time
   ...: known_xy = np.stack([gdf2.geometry.x, gdf2.geometry.y], -1)
   ...: tree = scipy.spatial.cKDTree(known_xy)
   ...: 
   ...: query_xy = np.stack([gdf1.geometry.x, gdf1.geometry.y], -1)
   ...: distances, indices = tree.query(query_xy)
   ...:
   ...:
CPU times: user 21 s, sys: 240 ms, total: 21.2 s
Wall time: 22.1 s

请注意，这种方法在两极附近或国际日期变更线附近效果不佳。如果您可以接受一定程度的错误并且您的点相当密集（因此距离不大），并且如果这些点落入中纬度并且不靠近国际日期变更线（假设您的数据在 -180/ 180) 比这个做法更合理。

两个 GeoDataFrame 的有效最近邻匹配

efficient nearest neighbor matching for two GeoDataFrames

python

distance

pandas

geopandas