使用 sklearn 的 2 个相似网格数据帧中最近的成员

Question

我有 2 个数据帧：

df1:

                    x             y        c0
2       468958.147443  4.633810e+06  1.253041
43      475516.484948  4.634928e+06  1.423767
72      475802.708042  4.635308e+06  1.294299
106     476658.696529  4.635686e+06  1.338760
133     472671.587615  4.636082e+06  1.325560
              ...           ...       ...
707923  394329.199687  5.006761e+06  1.155477
707980  409697.377813  5.006524e+06  1.223895
708570  411859.618686  5.006875e+06  1.093296
708576  413477.224756  5.006853e+06  1.161713
708695  445559.757010  5.006496e+06  1.149282

[12880 rows x 3 columns]

df2:

         kat    z0     kr             xx            yy
0        1.0  0.01  0.169  468526.696610  4.633654e+06
1        3.0  0.30  0.214  468757.270633  4.633653e+06
2        1.0  0.01  0.169  468066.930344  4.633965e+06
3        1.0  0.01  0.169  468297.494406  4.633964e+06
4        1.0  0.01  0.169  468528.058460  4.633963e+06
     ...   ...    ...            ...           ...
1287962  3.0  0.30  0.214  399566.653186  5.115395e+06
1287963  3.0  0.30  0.214  399781.023856  5.115391e+06
1287964  1.0  0.01  0.169  396570.675453  5.115753e+06
1287965  1.0  0.01  0.169  396785.035186  5.115750e+06
1287966  1.0  0.01  0.169  399571.712593  5.115703e+06

[1287967 rows x 5 columns]

我想在一定半径内找到最近的 df1 成员，比方说 df2 的 radius=500。然后我想把这个最近的 c0 值放到 df2 中。如果 radius=500 中没有 df1 点，我想在 df2 中将 c0 设置为 1.0。 (x,y)和(xx,yy)分别是df1和df2的平面坐标。

期望的输出（仅前 5 行的示例）：

         kat    z0     kr             xx            yy  c0
0        1.0  0.01  0.169  468526.696610  4.633654e+06  1.253041
1        3.0  0.30  0.214  468757.270633  4.633653e+06  1.253041
2        1.0  0.01  0.169  468066.930344  4.633965e+06  1.0
3        1.0  0.01  0.169  468297.494406  4.633964e+06  1.0
4        1.0  0.01  0.169  468528.058460  4.633963e+06  1.0
     ...   ...    ...            ...           ...
1287962  3.0  0.30  0.214  399566.653186  5.115395e+06  ...
1287963  3.0  0.30  0.214  399781.023856  5.115391e+06  ...
1287964  1.0  0.01  0.169  396570.675453  5.115753e+06  ...
1287965  1.0  0.01  0.169  396785.035186  5.115750e+06  ...
1287966  1.0  0.01  0.169  399571.712593  5.115703e+06  ...

我正在考虑将其转换为 shapefile 并在某些空间查询软件中工作。但我相信可以通过 sklearn 在这里找到有效的解决方案。提前致谢！

Answer 1

如果我没看错你的需求，你可以使用scipycKDTree。由于 C/Cython 实施，它享有相当快的声誉。试试看对你有没有帮助。

我只使用你的 df2 的前 5 行作为我的 df2。我的 df1 和你的样本 df1 一样。我还假设列 c0 是 df1 中的最后一列并且距离是 Euclidean

from scipy.spatial import cKDTree

df1_cTree = cKDTree(df1[['x','y']])
ix_arr = df1_cTree.query(df2[['xx','yy']], k=1, distance_upper_bound=500)[1]

df2['c0'] = [df1.iloc[x, -1] if x < len(df1) else 1 for x in ix_arr]

Out[438]:
   kat    z0     kr             xx         yy        c0
0  1.0  0.01  0.169  468526.696610  4633654.0  1.253041
1  3.0  0.30  0.214  468757.270633  4633653.0  1.253041
2  1.0  0.01  0.169  468066.930344  4633965.0  1.000000
3  1.0  0.01  0.169  468297.494406  4633964.0  1.000000
4  1.0  0.01  0.169  468528.058460  4633963.0  1.253041

注意：df2 的行索引 4 从 [468528.058460, 4633963.0] 到 df1 [468958.147443, 4633810] 的第 0 行的距离是 456.4926432，所以满足500以内的条件。因此，它的 c0 一定不能 1 作为您想要的输出。

使用 sklearn 的 2 个相似网格数据帧中最近的成员

nearest member in 2 similary griided dataframes with sklearn

python

dataframe

pandas

sklearn-pandas