根据地理坐标之间的距离将行从一个数据框添加到另一个数据框中

Question

我和这里有类似的问题两个数据框：

df1 = pd.DataFrame({'id': [1,2,3],                   
                          'lat':[-23.48, -22.94, -23.22],
                          'long':[-46.36, -45.40, -45.80]})

df2 = pd.DataFrame({'id': [100,200,300],                   
                           'lat':[-28.48, -22.94, -23.22],
                           'long':[-46.36, -46.40, -45.80]})

我的问题是：使用 Ben.T 那里建议的解决方案，如果 df2 的点不在 df 附近，我如何将 df2 的行添加到 df1？我认为，基于距离矩阵：

from sklearn.metrics.pairwise import haversine_distances

# variable in meter you can change
threshold = 100 # meters

# another parameter
earth_radius = 6371000  # meters

distance_matrix = (
    # get the distance between all points of each DF
    haversine_distances(
        # note that you need to convert to radiant with *np.pi/180
        X=df1[['lat','long']].to_numpy()*np.pi/180, 
        Y=df2[['lat','long']].to_numpy()*np.pi/180)
    # get the distance in meter
    *earth_radius
    # compare to your threshold
    < threshold
    # **here I want to add rows from df2 to df1 if point from df2 is NOT near df1**
    )

例如输出如下所示：

输出：

   id   lat       long  
    1   -23.48  -46.36    
    2   -22.94  -45.40    
    3   -23.22  -45.80    
    4   -28.48  -46.36
    5   -22.94  -46.40

Answer 1

距离矩阵为您提供了一个 (len(df1), len(df2)) 布尔数组，True 表示它们“接近”。您可以通过用 any 跨轴 0:

汇总矩阵来确定 df1 中的 any 点是否足够接近 df2 中的每个元素

In [33]: df2_has_close_point_in_df1 = distance_matrix.any(axis=0)

In [34]: df2_has_close_point_in_df1
Out[34]: array([False, False,  True])

然后您可以将其用作过滤 df2 的掩码。使用按位否定运算符 ~ 反转 True/False 值（仅获取 df2 中的行 not close:

In [35]: df2.iloc[~df2_has_close_point_in_df1]
Out[35]:
    id    lat   long
0  100 -28.48 -46.36
1  200 -22.94 -46.40

这现在可以与 df1 结合以获得组合数据集：

In [36]: combined = pd.concat([df1, df2.iloc[~df2_has_close_point_in_df1]], axis=0)

In [37]: combined
Out[37]:
    id    lat   long
0    1 -23.48 -46.36
1    2 -22.94 -45.40
2    3 -23.22 -45.80
0  100 -28.48 -46.36
1  200 -22.94 -46.40

根据地理坐标之间的距离将行从一个数据框添加到另一个数据框中

Add rows from one dataframe to another based on distances between their geographic coordinates

python

coordinates

geopy

pandas

输出：