SciKits BallTree 方法给我不正确的 "nearest neighbor"

Question

我正在使用下面给出的源中的代码来获取最近的“站点”。

来源：https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html

我的代码：

# Read data from a DB
test_df = pd.read_sql_query(sql, conn)

# Calculates distance between 2 points on a map using lat and long 
# (Source: https://towardsdatascience.com/heres-how-to-calculate-distance-between-2-geolocations-in-python-93ecab5bbba4)
def haversine_distance(lat1, lon1, lat2, lon2):
   r = 6371
   phi1 = np.radians(float(lat1))
   phi2 = np.radians(float(lat2))
   delta_phi = np.radians(lat2 - lat1)
   delta_lambda = np.radians(lon2- lon1)
   a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) *   np.sin(delta_lambda / 2)**2
   res = r * (2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a)))
   return np.round(res, 2)

test_df["actualDistance (km)"] = test_df.apply(lambda row: haversine_distance(row['ClientLat'],row['ClientLong'],row['actual_SLa'],row['actual_SLo']), axis=1)

test_gdf = geopandas.GeoDataFrame(test_df, geometry=geopandas.points_from_xy(test_df.ClientLong, test_df.ClientLat))
site_gdf = geopandas.GeoDataFrame(site_df, geometry=geopandas.points_from_xy(site_df.SiteLong, site_df.SiteLat))

#-------Set up the functions as shown in the tutorial-------

def get_nearest(src_points, candidates, k_neighbors=1):
    """Find nearest neighbors for all source points from a set of candidate points"""

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15, metric='haversine')

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    # Get closest indices and distances (i.e. array at index 0)
    # note: for the second closest points, you would take index 1, etc.
    closest = indices[0]
    closest_dist = distances[0]

    # Return indices and distances
    return (closest, closest_dist)


def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
    """
    For each point in left_gdf, find closest point in right GeoDataFrame and return them.

    NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
    """

    left_geom_col = left_gdf.geometry.name
    right_geom_col = right_gdf.geometry.name

    # Ensure that index in right gdf is formed of sequential numbers
    right = right_gdf.copy().reset_index(drop=True)

    # Parse coordinates from points and insert them into a numpy array as RADIANS
    left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
    right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())

    # Find the nearest points
    # -----------------------
    # closest ==> index in right_gdf that corresponds to the closest point
    # dist ==> distance between the nearest neighbors (in meters)

    closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)

    # Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
    closest_points = right.loc[closest]

    # Ensure that the index corresponds the one in left_gdf
    closest_points = closest_points.reset_index(drop=True)

    # Add distance if requested
    if return_dist:
        # Convert to meters from radians
        earth_radius = 6371000  # meters
        closest_points['distance'] = dist * earth_radius

    return closest_points

closest_sites = nearest_neighbor(test_gdf, site_gdf, return_dist=True)

# Rename the geometry of closest sites gdf so that we can easily identify it
closest_sites = closest_sites.rename(columns={'geometry': 'closest_site_geom'})

# Merge the datasets by index (for this, it is good to use '.join()' -function)
test_gdf = test_gdf.join(closest_sites)

#Extracted closest site latitude and longitude for data analysis
test_gdf['CS_lo'] = test_gdf.closest_site_geom.apply(lambda p: p.x)
test_gdf['CS_la'] = test_gdf.closest_site_geom.apply(lambda p: p.y)

该代码是我提供的教程link 的副本。根据他们的解释，它应该有效。

为了验证这个数据，我使用 .describe() 得到了一些统计数据，它告诉我教程方法确实给我一个平均距离，它比实际数据中的距离（792 米）更近与实际距离 1.80 公里）。 Closest Distance generated using the BallTree method Actual Distance in the data

然而，当我使用 plotly 在地图上绘制它们时，我注意到 BallTree 方法的输出并不比“实际”距离更近。 This is generally what the plotted data looks like (Blue: predetermined site, Red: site predicted using the BallTree method 谁能帮我找出差异

Answer 1

我不确定为什么会这样，但确实如此。我决定只根据文档编写代码，而不是按照教程进行操作，这很有效：

# Build BallTree with haversine distance metric, which expects (lat, lon) in radians and returns distances in radians
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(np.radians(site_df[['SiteLat', 'SiteLong']]), metric=dist)

test_coords = np.radians(test_df[['ClientLat', 'ClientLong']])
dists, ilocs = tree.query(test_coords)

Answer 2

问题是教程代码以 Longitude, Latitude 格式而不是 BallTree 预期的 Latitude, Longitude 格式提供坐标。因此，您正在测量反转点之间的距离。

如果您在坐标解析代码中调换 geom.x 和 geom.y 的顺序，您将获得正确的测量值。

# Parse coordinates from points and insert them into a numpy array as RADIANS
    left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())
    right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())

SciKits BallTree 方法给我不正确的 "nearest neighbor"

SciKits BallTree method gives me incorrect "nearest neighbor"

python

data-analysis

nearest-neighbor

scikit-learn