按最小地理空间距离 (R) 匹配两个数据集

Match two datasets by minimum geospatial distance (R)

我有以下两个数据集:

houses <- data.table(house_number = c(1:3),
                     lat_decimal = seq(1.1, 1.3, by = 0.1),
                     lon_decimal = seq(1.4, 1.6, by = 0.1))
stations <- data.table(station_numer = c(1:11),
                       lat_decimal = seq(1, 2, by = 0.1),
                       lon_decimal = seq(2, 3, by = 0.1))

我想将 housesstations 合并在一起,这样生成的 station_number 就是最接近相应 house_number 的车站。

,但我不确定他们是否使用纬度和经度,而且,我不知道在处理经度和纬度时如何计算距离(这就是为什么我更喜欢只需使用 geosphere 包中的 distm

我从未使用过外部函数。如果上述问题的答案可行,我如何调整答案以使用 distm 函数而不是 sqrt 函数?

你的问题比简单的合并要复杂一些,outer有点不适合这个目的。为了尽可能彻底,我们要计算所有房屋和车站组合之间的距离,然后只保留每个房屋最近的车站。我们需要两个包:

library(tidyverse)
library(geosphere)

首先,做一些准备工作。 distm 期望坐标首先排序为经度,其次是纬度(你有相反的),所以让我们解决这个问题,更好地命名列,并在我们这样做时更正拼写错误:

houses <- data.frame(house_number = c(1:3),
                     lon_house = seq(1.4, 1.6, by = 0.1),
                     lat_house = seq(1.1, 1.3, by = 0.1)
                     )
stations <- data.frame(station_number = c(1:11),
                       lon_station = seq(2, 3, by = 0.1),
                       lat_station = seq(1, 2, by = 0.1)
                       )

我们将创建 "nested" 个数据框,以便更轻松地将坐标保持在一起:

house_nest <- nest(houses, -house_number, .key = 'house_coords')
station_nest <- nest(stations, -station_number, .key = 'station_coords')

  house_number house_coords        
         <int> <list>              
1            1 <data.frame [1 × 2]>
2            2 <data.frame [1 × 2]>
3            3 <data.frame [1 × 2]>

   station_number station_coords      
            <int> <list>              
 1              1 <data.frame [1 × 2]>
 2              2 <data.frame [1 × 2]>
 3              3 <data.frame [1 × 2]>
 4              4 <data.frame [1 × 2]>
 5              5 <data.frame [1 × 2]>
 6              6 <data.frame [1 × 2]>
 7              7 <data.frame [1 × 2]>
 8              8 <data.frame [1 × 2]>
 9              9 <data.frame [1 × 2]>
10             10 <data.frame [1 × 2]>
11             11 <data.frame [1 × 2]>

使用dplyr::crossing合并两个数据框中的每一行:

data.master <- crossing(house_nest, station_nest)

   house_number house_coords         station_number station_coords      
          <int> <list>                        <int> <list>              
 1            1 <data.frame [1 × 2]>              1 <data.frame [1 × 2]>
 2            1 <data.frame [1 × 2]>              2 <data.frame [1 × 2]>
 3            1 <data.frame [1 × 2]>              3 <data.frame [1 × 2]>
 4            1 <data.frame [1 × 2]>              4 <data.frame [1 × 2]>
 5            1 <data.frame [1 × 2]>              5 <data.frame [1 × 2]>
 6            1 <data.frame [1 × 2]>              6 <data.frame [1 × 2]>
 7            1 <data.frame [1 × 2]>              7 <data.frame [1 × 2]>
 8            1 <data.frame [1 × 2]>              8 <data.frame [1 × 2]>
 9            1 <data.frame [1 × 2]>              9 <data.frame [1 × 2]>
10            1 <data.frame [1 × 2]>             10 <data.frame [1 × 2]>
# ... with 23 more rows

有了这一切,我们可以在每一行上使用 distm 来计算距离,并保持每个房屋的最短距离:

data.dist <- data.master %>% 
  mutate(dist = map2_dbl(house_coords, station_coords, distm)) %>% 
  group_by(house_number) %>% 
  filter(dist == min(dist))

  house_number house_coords         station_number station_coords         dist
         <int> <list>                        <int> <list>                <dbl>
1            1 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 67690.
2            2 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 59883.
3            3 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 55519.

使用 hutilscpp 中的 match_nrst_haversine:

library(hutilscpp)
houses[, c("station_number", "dist") := match_nrst_haversine(lat_decimal,
                                                             lon_decimal,
                                                             addresses_lat = stations$lat_decimal,
                                                             addresses_lon = stations$lon_decimal,
                                                             Index = stations$station_numer,
                                                             close_enough = 0,
                                                             cartesian_R = 5)]

houses
#>    house_number lat_decimal lon_decimal station_number     dist
#> 1:            1         1.1         1.4              1 67.62617
#> 2:            2         1.2         1.5              1 59.87076
#> 3:            3         1.3         1.6              1 55.59026

如果您的数据很多(即要匹配的点数超过一百万),您可能需要调整 close_enoughcartesian_R 以提高性能。

`cartesian_R`

The maximum radius of any address from the points to be geocoded. Used to accelerate the detection of minimum distances. Note, as the argument name suggests, the distance is in cartesian coordinates, so a small number is likely.

`close_enough`    

The distance, in metres, below which a match will be considered to have occurred. (The distance that is considered "close enough" to be a match.)

For example, close_enough = 10 means the first location within ten metres will be matched, even if a closer match occurs later.

May be provided as a string to emphasize the units, e.g. close_enough = "0.25km". Only km and m are permitted.