成对距离计算嵌套数据框
Pairwise distance calculation nested data frame
我正在寻找一种方法来以成对方式计算点之间的分隔距离,并将每个点的结果存储在随附的嵌套数据框中。
例如,我有这个数据框(来自地图包),其中包含有关我们城市的信息,包括它们的物理位置。我丢弃了其余信息并将坐标嵌套在嵌套数据框中。我打算使用 geosphere
包中的 distHaversine()
来计算这些距离。
library(tidyverse)
df <- maps::us.cities %>%
slice(1:20) %>%
group_by(name) %>%
nest(long, lat, .key = coords)
name coords
<chr> <list>
1 Abilene TX <tibble [1 x 2]>
2 Akron OH <tibble [1 x 2]>
3 Alameda CA <tibble [1 x 2]>
4 Albany GA <tibble [1 x 2]>
5 Albany NY <tibble [1 x 2]>
...(With 15 more rows)
我研究过将 map 函数族与 mutate 结合使用,但我遇到了困难。所需结果的形式如下:
name coords sep_dist
<chr> <list> <list>
1 Abilene TX <tibble [1 x 2]> <tibble [19 x 2]>
2 Akron OH <tibble [1 x 2]> <tibble [19 x 2]>
3 Alameda CA <tibble [1 x 2]> <tibble [19 x 2]>
4 Albany GA <tibble [1 x 2]> <tibble [19 x 2]>
5 Albany NY <tibble [1 x 2]> <tibble [19 x 2]>
...(With 15 more rows)
sep_dist tibbles 看起来像这样:
location distance
<chr> <dbl>
1 Akron OH 1003
2 Alameda CA 428
3 Albany GA 3218
4 Albany NY 3627
5 Albany OR 97
...(With 14 more rows) -distances completely made up
位置是与名称(在本例中为阿比林)进行比较的点。
我们可以用位置名称和坐标的所有组合扩展一个"grid",但删除具有相同位置名称的组合。之后,使用map2_dbl
应用distHaversine
函数。
library(tidyverse)
library(geosphere)
df2 <- df %>%
# Create the grid
mutate(name1 = name) %>%
select(starts_with("name")) %>%
complete(name, name1) %>%
filter(name != name1) %>%
left_join(df, by = "name") %>%
left_join(df, by = c("name1" = "name")) %>%
# Grid completed. Calcualte the distance by distHaversine
mutate(distance = map2_dbl(coords.x, coords.y, distHaversine))
df2
# A tibble: 380 x 5
name name1 coords.x coords.y distance
<chr> <chr> <list> <list> <dbl>
1 Abilene TX Akron OH <tibble [1 x 2]> <tibble [1 x 2]> 1881904.4
2 Abilene TX Alameda CA <tibble [1 x 2]> <tibble [1 x 2]> 2128576.9
3 Abilene TX Albany GA <tibble [1 x 2]> <tibble [1 x 2]> 1470577.2
4 Abilene TX Albany NY <tibble [1 x 2]> <tibble [1 x 2]> 2542025.1
5 Abilene TX Albany OR <tibble [1 x 2]> <tibble [1 x 2]> 2429367.3
6 Abilene TX Albuquerque NM <tibble [1 x 2]> <tibble [1 x 2]> 702287.5
7 Abilene TX Alexandria LA <tibble [1 x 2]> <tibble [1 x 2]> 700093.2
8 Abilene TX Alexandria VA <tibble [1 x 2]> <tibble [1 x 2]> 2161594.6
9 Abilene TX Alhambra CA <tibble [1 x 2]> <tibble [1 x 2]> 1718967.5
10 Abilene TX Aliso Viejo CA <tibble [1 x 2]> <tibble [1 x 2]> 1681868.8
# ... with 370 more rows
要创建最终输出,我们可以 group_by
基于名称和 nest
所有其他所需的列。
df3 <- df2 %>%
select(-starts_with("coord")) %>%
group_by(name) %>%
nest()
df3
# A tibble: 20 x 2
name data
<chr> <list>
1 Abilene TX <tibble [19 x 2]>
2 Akron OH <tibble [19 x 2]>
3 Alameda CA <tibble [19 x 2]>
4 Albany GA <tibble [19 x 2]>
5 Albany NY <tibble [19 x 2]>
6 Albany OR <tibble [19 x 2]>
7 Albuquerque NM <tibble [19 x 2]>
8 Alexandria LA <tibble [19 x 2]>
9 Alexandria VA <tibble [19 x 2]>
10 Alhambra CA <tibble [19 x 2]>
11 Aliso Viejo CA <tibble [19 x 2]>
12 Allen TX <tibble [19 x 2]>
13 Allentown PA <tibble [19 x 2]>
14 Aloha OR <tibble [19 x 2]>
15 Altadena CA <tibble [19 x 2]>
16 Altamonte Springs FL <tibble [19 x 2]>
17 Altoona PA <tibble [19 x 2]>
18 Amarillo TX <tibble [19 x 2]>
19 Ames IA <tibble [19 x 2]>
20 Anaheim CA <tibble [19 x 2]>
data
中的每个数据框现在看起来像这样。
df3$data[[1]]
# A tibble: 19 x 2
name1 distance
<chr> <dbl>
1 Akron OH 1881904.4
2 Alameda CA 2128576.9
3 Albany GA 1470577.2
4 Albany NY 2542025.1
5 Albany OR 2429367.3
6 Albuquerque NM 702287.5
7 Alexandria LA 700093.2
8 Alexandria VA 2161594.6
9 Alhambra CA 1718967.5
10 Aliso Viejo CA 1681868.8
11 Allen TX 296560.4
12 Allentown PA 2342363.5
13 Aloha OR 2457938.8
14 Altadena CA 1719207.6
15 Altamonte Springs FL 1805480.9
16 Altoona PA 2102993.0
17 Amarillo TX 361520.0
18 Ames IA 1194234.7
19 Anaheim CA 1694698.9
geosphere
提供了将所有距离与 distm
进行比较的能力
可重现的数据
set.seed(1)
df <- data.frame(name=letters[1:4],
lon=runif(4)*10,
lat=runif(4)*10)
distm
library(geosphere)
ans <- as.data.frame(distm(df[,2:3], df[,2:3], fun=distHaversine))
# a b c d
# 1 0.0 784506.1 894320.6 877440.5
# 2 784506.1 0.0 226504.3 647666.7
# 3 894320.6 226504.3 0.0 486290.8
# 4 877440.5 647666.7 486290.8 0.0
整理成想要的格式
colnames(ans) <- df$name
library(dplyr)
library(tidyr)
desired <- ans %>%
gather(pos1, distance) %>%
mutate(pos2 = rep(df$name, nrow(df))) %>%
filter(pos1!=pos2) %>%
select(pos1, pos2, distance)
# pos1 pos2 distance
# 1 a b 784506.1
# 2 a c 894320.6
# 3 a d 877440.5
# 4 b a 784506.1
# 5 b c 226504.3
# 6 b d 647666.7
# 7 c a 894320.6
# 8 c b 226504.3
# 9 c d 486290.8
# 10 d a 877440.5
# 11 d b 647666.7
# 12 d c 486290.8
我正在寻找一种方法来以成对方式计算点之间的分隔距离,并将每个点的结果存储在随附的嵌套数据框中。
例如,我有这个数据框(来自地图包),其中包含有关我们城市的信息,包括它们的物理位置。我丢弃了其余信息并将坐标嵌套在嵌套数据框中。我打算使用 geosphere
包中的 distHaversine()
来计算这些距离。
library(tidyverse)
df <- maps::us.cities %>%
slice(1:20) %>%
group_by(name) %>%
nest(long, lat, .key = coords)
name coords
<chr> <list>
1 Abilene TX <tibble [1 x 2]>
2 Akron OH <tibble [1 x 2]>
3 Alameda CA <tibble [1 x 2]>
4 Albany GA <tibble [1 x 2]>
5 Albany NY <tibble [1 x 2]>
...(With 15 more rows)
我研究过将 map 函数族与 mutate 结合使用,但我遇到了困难。所需结果的形式如下:
name coords sep_dist
<chr> <list> <list>
1 Abilene TX <tibble [1 x 2]> <tibble [19 x 2]>
2 Akron OH <tibble [1 x 2]> <tibble [19 x 2]>
3 Alameda CA <tibble [1 x 2]> <tibble [19 x 2]>
4 Albany GA <tibble [1 x 2]> <tibble [19 x 2]>
5 Albany NY <tibble [1 x 2]> <tibble [19 x 2]>
...(With 15 more rows)
sep_dist tibbles 看起来像这样:
location distance
<chr> <dbl>
1 Akron OH 1003
2 Alameda CA 428
3 Albany GA 3218
4 Albany NY 3627
5 Albany OR 97
...(With 14 more rows) -distances completely made up
位置是与名称(在本例中为阿比林)进行比较的点。
我们可以用位置名称和坐标的所有组合扩展一个"grid",但删除具有相同位置名称的组合。之后,使用map2_dbl
应用distHaversine
函数。
library(tidyverse)
library(geosphere)
df2 <- df %>%
# Create the grid
mutate(name1 = name) %>%
select(starts_with("name")) %>%
complete(name, name1) %>%
filter(name != name1) %>%
left_join(df, by = "name") %>%
left_join(df, by = c("name1" = "name")) %>%
# Grid completed. Calcualte the distance by distHaversine
mutate(distance = map2_dbl(coords.x, coords.y, distHaversine))
df2
# A tibble: 380 x 5
name name1 coords.x coords.y distance
<chr> <chr> <list> <list> <dbl>
1 Abilene TX Akron OH <tibble [1 x 2]> <tibble [1 x 2]> 1881904.4
2 Abilene TX Alameda CA <tibble [1 x 2]> <tibble [1 x 2]> 2128576.9
3 Abilene TX Albany GA <tibble [1 x 2]> <tibble [1 x 2]> 1470577.2
4 Abilene TX Albany NY <tibble [1 x 2]> <tibble [1 x 2]> 2542025.1
5 Abilene TX Albany OR <tibble [1 x 2]> <tibble [1 x 2]> 2429367.3
6 Abilene TX Albuquerque NM <tibble [1 x 2]> <tibble [1 x 2]> 702287.5
7 Abilene TX Alexandria LA <tibble [1 x 2]> <tibble [1 x 2]> 700093.2
8 Abilene TX Alexandria VA <tibble [1 x 2]> <tibble [1 x 2]> 2161594.6
9 Abilene TX Alhambra CA <tibble [1 x 2]> <tibble [1 x 2]> 1718967.5
10 Abilene TX Aliso Viejo CA <tibble [1 x 2]> <tibble [1 x 2]> 1681868.8
# ... with 370 more rows
要创建最终输出,我们可以 group_by
基于名称和 nest
所有其他所需的列。
df3 <- df2 %>%
select(-starts_with("coord")) %>%
group_by(name) %>%
nest()
df3
# A tibble: 20 x 2
name data
<chr> <list>
1 Abilene TX <tibble [19 x 2]>
2 Akron OH <tibble [19 x 2]>
3 Alameda CA <tibble [19 x 2]>
4 Albany GA <tibble [19 x 2]>
5 Albany NY <tibble [19 x 2]>
6 Albany OR <tibble [19 x 2]>
7 Albuquerque NM <tibble [19 x 2]>
8 Alexandria LA <tibble [19 x 2]>
9 Alexandria VA <tibble [19 x 2]>
10 Alhambra CA <tibble [19 x 2]>
11 Aliso Viejo CA <tibble [19 x 2]>
12 Allen TX <tibble [19 x 2]>
13 Allentown PA <tibble [19 x 2]>
14 Aloha OR <tibble [19 x 2]>
15 Altadena CA <tibble [19 x 2]>
16 Altamonte Springs FL <tibble [19 x 2]>
17 Altoona PA <tibble [19 x 2]>
18 Amarillo TX <tibble [19 x 2]>
19 Ames IA <tibble [19 x 2]>
20 Anaheim CA <tibble [19 x 2]>
data
中的每个数据框现在看起来像这样。
df3$data[[1]]
# A tibble: 19 x 2
name1 distance
<chr> <dbl>
1 Akron OH 1881904.4
2 Alameda CA 2128576.9
3 Albany GA 1470577.2
4 Albany NY 2542025.1
5 Albany OR 2429367.3
6 Albuquerque NM 702287.5
7 Alexandria LA 700093.2
8 Alexandria VA 2161594.6
9 Alhambra CA 1718967.5
10 Aliso Viejo CA 1681868.8
11 Allen TX 296560.4
12 Allentown PA 2342363.5
13 Aloha OR 2457938.8
14 Altadena CA 1719207.6
15 Altamonte Springs FL 1805480.9
16 Altoona PA 2102993.0
17 Amarillo TX 361520.0
18 Ames IA 1194234.7
19 Anaheim CA 1694698.9
geosphere
提供了将所有距离与 distm
可重现的数据
set.seed(1)
df <- data.frame(name=letters[1:4],
lon=runif(4)*10,
lat=runif(4)*10)
distm
library(geosphere)
ans <- as.data.frame(distm(df[,2:3], df[,2:3], fun=distHaversine))
# a b c d
# 1 0.0 784506.1 894320.6 877440.5
# 2 784506.1 0.0 226504.3 647666.7
# 3 894320.6 226504.3 0.0 486290.8
# 4 877440.5 647666.7 486290.8 0.0
整理成想要的格式
colnames(ans) <- df$name
library(dplyr)
library(tidyr)
desired <- ans %>%
gather(pos1, distance) %>%
mutate(pos2 = rep(df$name, nrow(df))) %>%
filter(pos1!=pos2) %>%
select(pos1, pos2, distance)
# pos1 pos2 distance
# 1 a b 784506.1
# 2 a c 894320.6
# 3 a d 877440.5
# 4 b a 784506.1
# 5 b c 226504.3
# 6 b d 647666.7
# 7 c a 894320.6
# 8 c b 226504.3
# 9 c d 486290.8
# 10 d a 877440.5
# 11 d b 647666.7
# 12 d c 486290.8