两个 GeoDataFrame 的有效最近邻匹配
efficient nearest neighbor matching for two GeoDataFrames
我正在对两个数据帧(100k 行和 1M 行)进行一些距离处理。
我的处理目前需要 20 天,我想看看我是否可以改进我的代码以加快流程。
我在这里的建议后使用了 geopandas,这大大加快了我迭代中的排序,但我想知道我是否可以用不同的方式编码或一些最佳实践。
这是我的表格:
dfb
osm_id name type ... surf centro valeurseuil
0 579194418 <NA> yes ... 31.698915 [450755.82645944477, 1751747.6864094366] 9.529465
1 579207356 <NA> yes ... 13.176170 [451636.3912408436, 1749870.2332154014] 6.143854
2 579211060 <NA> ruins ... 38.139254 [452363.0241478424, 1748257.9797045246] 10.452795
3 369649074 <NA> yes ... 48.625159 [451683.6453331233, 1748501.9621812948] 11.802577
4 267028254 REFUGE DU MAUPAS yes ... 56.936793 [453183.133164252, 1747379.3537066604] 12.771527
... ... ... ... ... ... ... ...
935472 252424617 NaN yes ... 20.135543 [562867.4086043949, 1835680.070006147] 7.595004
935473 252424662 NaN yes ... 26.000756 [562843.7024093983, 1835627.7568757713] 8.630567
935474 252424658 NaN yes ... 26.933184 [562823.0171366152, 1835635.2213349422] 8.783956
935475 252424810 NaN yes ... 40.663507 [562827.2984292071, 1835603.3674661077] 10.793163
935476 252424878 NaN yes ... 49.335093 [562822.9908286823, 1835585.4379250652] 11.888424
[935477 rows x 9 columns]
gdfo
ogc_fid id code_cs ... coordpoints surf centro
0 3 OCSGE0000000000000741775 CS2.2.1 ... [[495777.7174760493, 1805519.5430061778], [495... 1174.564617 [495632.1007243945, 1805463.0834386032]
1 16 OCSGE0000000000000947172 CS2.2.1 ... [[544263.7691919031, 1824731.9681054198], [544... 20054.686775 [544368.7095351141, 1824617.8708354477]
2 25 OCSGE0000000000000949293 CS2.2.1 ... [[535161.228444845, 1844915.1013712562], [5351... 295.911740 [535212.0861638768, 1844894.4575003278]
3 29 OCSGE0000000000000947839 CS2.2.1 ... [[533186.6035670156, 1837867.7088815654], [533... 3466.870293 [533173.6347083747, 1837936.4649687177]
4 193406 OCSGE0000000000000739484 CS2.2.1 ... [[458053.7764636817, 1757545.0501438894], [458... 4424.495046 [457942.74975193664, 1757488.5605310446]
... ... ... ... ... ... ... ...
83870 393015 OCSGE0000000000000807891 CS2.2.1 ... [[513245.68605544185, 1819995.2010655974], [51... 3416.327411 [513269.2562117624, 1819960.69636371]
83871 393050 OCSGE0000000000000176585 CS2.2.1 ... [[483728.63284117245, 1781422.0428754487], [48... 0.032123 [483713.9059494421, 1781392.2606697257]
83872 393057 OCSGE0000000000000813649 CS2.2.1 ... [[516662.97000782896, 1860487.2357337545], [51... 722.841230 [516719.98876274703, 1860521.6746725072]
83873 393062 OCSGE0000000000000954112 CS2.2.1 ... [[543018.616240293, 1832845.9711751717], [5430... 481.191268 [543013.2243556273, 1832823.7731046807]
83874 393071 OCSGE0000000000001016440 CS2.2.1 ... [[530307.8027104639, 1843478.1113854842], [530... 88.841634 [530310.0204813549, 1843428.549428356]
[83875 rows x 9 columns]
这是我的代码:
dfb=pd.read_csv(building,sep='#')
dfo=pd.read_csv(occsol,sep='#')
dfb['geot']='non'
gs = gpd.GeoSeries.from_wkt(dfo['geometry'], crs='EPSG:27572')
gdfo=gpd.GeoDataFrame(dfo,geometry=gs)
dfb['valeurseuil'] = 3 * ((dfb['surf'] / 3.141592653589793) ** (1 / 2)) # this is a treshold
m = 0
fin = len(dfb)
for i in range(len(dfb)):
gdfo['dist']=gdfo['geometry'].distance(Point(dfb.iloc[i]['centro'][0],dfb.iloc[i]['centro'][1]))
gdfo = gdfo.sort_values(by='dist')
for j in range(2): # 3 first polygons sorted by ascending distance
XYPtj = gdfo.iloc[j]['coordpoints']
compteur = 0
temp = []
for l in XYPtj:
dist = self.distancepoint([dfb.iloc[i]['centro'], l])
temp.append(dist)
for d in temp:
if d < dfb.iloc[i]['valeurseuil']: #treshold
compteur += 1 #
if compteur >= 2 and gdfo.iloc[j]['surf'] >= 1.5*dfb.iloc[i]['surf']:
gdfo.iloc[j]['surf']-= 1.5*dfb.iloc[i]['surf'] # on enleve du potentiel à hypothese 1:1.5
dfb.iloc[i]['geot'] = 'oui'
m += 1
print('avancement : '+ str(m) + '/ ' + str(fin))
dfb.to_csv('buildingeot',sep='#', index=False)
def distancepoint(self, xy):
"distance euclidienne,le systeme de coordonnees nest pas precise"
if self.valeursabs(xy[0][0] - xy[1][0]) < 100000 and self.valeursabs(
xy[0][1] - xy[1][1]) < 100000: ##verifier limit
d = ((xy[1][1] - xy[0][1]) ** 2 + (xy[1][0] - xy[0][0]) ** 2) ** (1 / 2)
else:
d = 666666
return d
特别是对于这种规模的问题,值得寻找矢量化算法。对于像这样的多对多匹配问题,numpy 和 scipy 提供了许多算法,这些算法的性能将大大优于 pandas groupby 或循环选项,以至于您自己管理索引所需的额外努力是通常值得麻烦。
有很多方法可以解决这个问题,但如果您的目标只是使用欧几里德近似值找到最近点,那么没有比 scipy.spatial.cKDTree
更简单的方法了。以下代码在第二个数据集(100 万行)中查找最接近第一个数据集中 100k 个点的点的位置索引,并在 ~20 秒内运行:
In [1]: import geopandas as gpd, numpy as np, scipy.spatial, pandas as pd
In [2]: # set up GeoDataFrame with 100k random points
...: gdf1 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(
...: (np.random.random(size=int(1e5)) * 360 - 180),
...: (np.random.random(size=int(1e5)) * 180 - 90),
...: ))
In [3]: # set up GeoDataFrame with 1M random points
...: gdf2 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(
...: (np.random.random(size=int(1e6)) * 360 - 180),
...: (np.random.random(size=int(1e6)) * 180 - 90),
...: ))
In [4]: %%time
...: known_xy = np.stack([gdf2.geometry.x, gdf2.geometry.y], -1)
...: tree = scipy.spatial.cKDTree(known_xy)
...:
...: query_xy = np.stack([gdf1.geometry.x, gdf1.geometry.y], -1)
...: distances, indices = tree.query(query_xy)
...:
...:
CPU times: user 21 s, sys: 240 ms, total: 21.2 s
Wall time: 22.1 s
请注意,这种方法在两极附近或国际日期变更线附近效果不佳。如果您可以接受一定程度的错误并且您的点相当密集(因此距离不大),并且如果这些点落入中纬度并且不靠近国际日期变更线(假设您的数据在 -180/ 180) 比这个做法更合理。
我正在对两个数据帧(100k 行和 1M 行)进行一些距离处理。 我的处理目前需要 20 天,我想看看我是否可以改进我的代码以加快流程。 我在这里的建议后使用了 geopandas,这大大加快了我迭代中的排序,但我想知道我是否可以用不同的方式编码或一些最佳实践。
这是我的表格:
dfb
osm_id name type ... surf centro valeurseuil
0 579194418 <NA> yes ... 31.698915 [450755.82645944477, 1751747.6864094366] 9.529465
1 579207356 <NA> yes ... 13.176170 [451636.3912408436, 1749870.2332154014] 6.143854
2 579211060 <NA> ruins ... 38.139254 [452363.0241478424, 1748257.9797045246] 10.452795
3 369649074 <NA> yes ... 48.625159 [451683.6453331233, 1748501.9621812948] 11.802577
4 267028254 REFUGE DU MAUPAS yes ... 56.936793 [453183.133164252, 1747379.3537066604] 12.771527
... ... ... ... ... ... ... ...
935472 252424617 NaN yes ... 20.135543 [562867.4086043949, 1835680.070006147] 7.595004
935473 252424662 NaN yes ... 26.000756 [562843.7024093983, 1835627.7568757713] 8.630567
935474 252424658 NaN yes ... 26.933184 [562823.0171366152, 1835635.2213349422] 8.783956
935475 252424810 NaN yes ... 40.663507 [562827.2984292071, 1835603.3674661077] 10.793163
935476 252424878 NaN yes ... 49.335093 [562822.9908286823, 1835585.4379250652] 11.888424
[935477 rows x 9 columns]
gdfo
ogc_fid id code_cs ... coordpoints surf centro
0 3 OCSGE0000000000000741775 CS2.2.1 ... [[495777.7174760493, 1805519.5430061778], [495... 1174.564617 [495632.1007243945, 1805463.0834386032]
1 16 OCSGE0000000000000947172 CS2.2.1 ... [[544263.7691919031, 1824731.9681054198], [544... 20054.686775 [544368.7095351141, 1824617.8708354477]
2 25 OCSGE0000000000000949293 CS2.2.1 ... [[535161.228444845, 1844915.1013712562], [5351... 295.911740 [535212.0861638768, 1844894.4575003278]
3 29 OCSGE0000000000000947839 CS2.2.1 ... [[533186.6035670156, 1837867.7088815654], [533... 3466.870293 [533173.6347083747, 1837936.4649687177]
4 193406 OCSGE0000000000000739484 CS2.2.1 ... [[458053.7764636817, 1757545.0501438894], [458... 4424.495046 [457942.74975193664, 1757488.5605310446]
... ... ... ... ... ... ... ...
83870 393015 OCSGE0000000000000807891 CS2.2.1 ... [[513245.68605544185, 1819995.2010655974], [51... 3416.327411 [513269.2562117624, 1819960.69636371]
83871 393050 OCSGE0000000000000176585 CS2.2.1 ... [[483728.63284117245, 1781422.0428754487], [48... 0.032123 [483713.9059494421, 1781392.2606697257]
83872 393057 OCSGE0000000000000813649 CS2.2.1 ... [[516662.97000782896, 1860487.2357337545], [51... 722.841230 [516719.98876274703, 1860521.6746725072]
83873 393062 OCSGE0000000000000954112 CS2.2.1 ... [[543018.616240293, 1832845.9711751717], [5430... 481.191268 [543013.2243556273, 1832823.7731046807]
83874 393071 OCSGE0000000000001016440 CS2.2.1 ... [[530307.8027104639, 1843478.1113854842], [530... 88.841634 [530310.0204813549, 1843428.549428356]
[83875 rows x 9 columns]
这是我的代码:
dfb=pd.read_csv(building,sep='#')
dfo=pd.read_csv(occsol,sep='#')
dfb['geot']='non'
gs = gpd.GeoSeries.from_wkt(dfo['geometry'], crs='EPSG:27572')
gdfo=gpd.GeoDataFrame(dfo,geometry=gs)
dfb['valeurseuil'] = 3 * ((dfb['surf'] / 3.141592653589793) ** (1 / 2)) # this is a treshold
m = 0
fin = len(dfb)
for i in range(len(dfb)):
gdfo['dist']=gdfo['geometry'].distance(Point(dfb.iloc[i]['centro'][0],dfb.iloc[i]['centro'][1]))
gdfo = gdfo.sort_values(by='dist')
for j in range(2): # 3 first polygons sorted by ascending distance
XYPtj = gdfo.iloc[j]['coordpoints']
compteur = 0
temp = []
for l in XYPtj:
dist = self.distancepoint([dfb.iloc[i]['centro'], l])
temp.append(dist)
for d in temp:
if d < dfb.iloc[i]['valeurseuil']: #treshold
compteur += 1 #
if compteur >= 2 and gdfo.iloc[j]['surf'] >= 1.5*dfb.iloc[i]['surf']:
gdfo.iloc[j]['surf']-= 1.5*dfb.iloc[i]['surf'] # on enleve du potentiel à hypothese 1:1.5
dfb.iloc[i]['geot'] = 'oui'
m += 1
print('avancement : '+ str(m) + '/ ' + str(fin))
dfb.to_csv('buildingeot',sep='#', index=False)
def distancepoint(self, xy):
"distance euclidienne,le systeme de coordonnees nest pas precise"
if self.valeursabs(xy[0][0] - xy[1][0]) < 100000 and self.valeursabs(
xy[0][1] - xy[1][1]) < 100000: ##verifier limit
d = ((xy[1][1] - xy[0][1]) ** 2 + (xy[1][0] - xy[0][0]) ** 2) ** (1 / 2)
else:
d = 666666
return d
特别是对于这种规模的问题,值得寻找矢量化算法。对于像这样的多对多匹配问题,numpy 和 scipy 提供了许多算法,这些算法的性能将大大优于 pandas groupby 或循环选项,以至于您自己管理索引所需的额外努力是通常值得麻烦。
有很多方法可以解决这个问题,但如果您的目标只是使用欧几里德近似值找到最近点,那么没有比 scipy.spatial.cKDTree
更简单的方法了。以下代码在第二个数据集(100 万行)中查找最接近第一个数据集中 100k 个点的点的位置索引,并在 ~20 秒内运行:
In [1]: import geopandas as gpd, numpy as np, scipy.spatial, pandas as pd
In [2]: # set up GeoDataFrame with 100k random points
...: gdf1 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(
...: (np.random.random(size=int(1e5)) * 360 - 180),
...: (np.random.random(size=int(1e5)) * 180 - 90),
...: ))
In [3]: # set up GeoDataFrame with 1M random points
...: gdf2 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(
...: (np.random.random(size=int(1e6)) * 360 - 180),
...: (np.random.random(size=int(1e6)) * 180 - 90),
...: ))
In [4]: %%time
...: known_xy = np.stack([gdf2.geometry.x, gdf2.geometry.y], -1)
...: tree = scipy.spatial.cKDTree(known_xy)
...:
...: query_xy = np.stack([gdf1.geometry.x, gdf1.geometry.y], -1)
...: distances, indices = tree.query(query_xy)
...:
...:
CPU times: user 21 s, sys: 240 ms, total: 21.2 s
Wall time: 22.1 s
请注意,这种方法在两极附近或国际日期变更线附近效果不佳。如果您可以接受一定程度的错误并且您的点相当密集(因此距离不大),并且如果这些点落入中纬度并且不靠近国际日期变更线(假设您的数据在 -180/ 180) 比这个做法更合理。