如何计算大地理距离矩阵
How to compute big geographic distance matrix
我有一个包含 ID 和坐标的数据框。我需要计算我所有id之间的地理距离,去掉那些相距太远的id,然后继续我的分析。
我有 30k 个 ID,这将生成一个 30k x 30k 矩阵。这是一个示例:
latitude longitude id
-23.52472 -46.47785 917_62346
-23.62010 -46.69345 244_42975
-23.61636 -46.48148 302_75289
-23.53826 -46.46756 917_96304
-23.58266 -46.54495 302_84126
-23.47005 -46.70921 233_97098
-23.49235 -46.49342 917_62953
-23.52226 -46.72710 244_42245
-23.64853 -46.72237 635_90928
-23.49640 -46.61215 244_2662
x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624,
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581,
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512,
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192,
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975",
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953",
"244_42245", "635_90928", "244_2662")), .Names = c("latitude",
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L,
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")
首先我尝试直接使用 geosphere
包:
library(geosphere)
library(data.table)
d.matrix <- distm(cbind(x2$longitude, x2$latitude))
这不起作用,因为内存问题,Error: cannot allocate vector of size 15.4 Gb
。我的第二次尝试是先生成所有的成对组合,然后与原始数据集合并得到纬度和经度,然后计算距离,例如
dis.long <- expand.grid(x2$id, x2$id)
dis.long <- merge(dis.long, x2, by.x = "Var1", by.y = "id")
dis.long <- merge(dis.long, x2, by.x = "Var2", by.y = "id")
dis.long <- dis.long[ , dist_km2 := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2),
matrix(c(longitude.y, latitude.y), ncol = 2))/1000]
但是,expand_grid 内存不足。这对我来说很奇怪,因为生成的矩阵将是 900mi 行 x 2 列,而且我已经处理了更大的数据集(比如 200mi x 50 矩阵)。
另一个观察,我已经尝试使用诸如 new_id = seq(1L,30000L,1L)
的 id 来查看整数是否可以解决它,但是当我尝试扩展时我遇到了同样的内存问题。
我目前在这些配置下,除了 16gb Ram 桌面
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8 geosphere_1.5-5 sp_1.2-5 haven_1.0.0
[7] stringr_1.2.0 data.table_1.10.4
谁能告诉我如何计算这些距离?为什么我不能生成这个特定的 expand.grid 而能够构建更大的对象?
不需要全对全比,包括自比和定向比(A-B != B-A);因此你应该尝试 combn
而不是 expand.grid
您的数据
x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624,
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581,
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512,
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192,
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975",
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953",
"244_42245", "635_90928", "244_2662")), .Names = c("latitude",
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L,
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")
expand.grid
OP <- function(df) {
x3 = expand.grid(df$id, df$id)
Y <- merge(x3, df, by.x = "Var1", by.y = "id")
Y <- merge(Y, df, by.x = "Var2", by.y = "id")
return(Y)
}
对比组合
CP <- function(df) {
Did = as.data.frame(t(combn(df$id, 2)))
Z <- merge(Did, df, by.x = "V1", by.y = "id")
Z <- merge(Z, df, by.x = "V2", by.y = "id")
return(Z)
}
比较
dis.long <- OP(x2)
object.size(dis.long)
# 7320 bytes
new <- CP(x2)
object.size(new)
# 5016 bytes
更大的例子
num <- 5e2
bigx <- data.frame(latitude=rnorm(num)*-23, longitude=rnorm(num)*-46, id=1:num)
bigdl <- OP(bigx)
object.size(bigdl)
# 10001224 bytes
bignew <- CP(bigx)
object.size(bignew)
# 4991224 bytes
大约一半大小
我有一个包含 ID 和坐标的数据框。我需要计算我所有id之间的地理距离,去掉那些相距太远的id,然后继续我的分析。
我有 30k 个 ID,这将生成一个 30k x 30k 矩阵。这是一个示例:
latitude longitude id
-23.52472 -46.47785 917_62346
-23.62010 -46.69345 244_42975
-23.61636 -46.48148 302_75289
-23.53826 -46.46756 917_96304
-23.58266 -46.54495 302_84126
-23.47005 -46.70921 233_97098
-23.49235 -46.49342 917_62953
-23.52226 -46.72710 244_42245
-23.64853 -46.72237 635_90928
-23.49640 -46.61215 244_2662
x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624,
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581,
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512,
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192,
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975",
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953",
"244_42245", "635_90928", "244_2662")), .Names = c("latitude",
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L,
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")
首先我尝试直接使用 geosphere
包:
library(geosphere)
library(data.table)
d.matrix <- distm(cbind(x2$longitude, x2$latitude))
这不起作用,因为内存问题,Error: cannot allocate vector of size 15.4 Gb
。我的第二次尝试是先生成所有的成对组合,然后与原始数据集合并得到纬度和经度,然后计算距离,例如
dis.long <- expand.grid(x2$id, x2$id)
dis.long <- merge(dis.long, x2, by.x = "Var1", by.y = "id")
dis.long <- merge(dis.long, x2, by.x = "Var2", by.y = "id")
dis.long <- dis.long[ , dist_km2 := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2),
matrix(c(longitude.y, latitude.y), ncol = 2))/1000]
但是,expand_grid 内存不足。这对我来说很奇怪,因为生成的矩阵将是 900mi 行 x 2 列,而且我已经处理了更大的数据集(比如 200mi x 50 矩阵)。
另一个观察,我已经尝试使用诸如 new_id = seq(1L,30000L,1L)
的 id 来查看整数是否可以解决它,但是当我尝试扩展时我遇到了同样的内存问题。
我目前在这些配置下,除了 16gb Ram 桌面
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8 geosphere_1.5-5 sp_1.2-5 haven_1.0.0
[7] stringr_1.2.0 data.table_1.10.4
谁能告诉我如何计算这些距离?为什么我不能生成这个特定的 expand.grid 而能够构建更大的对象?
不需要全对全比,包括自比和定向比(A-B != B-A);因此你应该尝试 combn
而不是 expand.grid
您的数据
x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624,
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581,
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512,
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192,
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975",
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953",
"244_42245", "635_90928", "244_2662")), .Names = c("latitude",
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L,
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")
expand.grid
OP <- function(df) {
x3 = expand.grid(df$id, df$id)
Y <- merge(x3, df, by.x = "Var1", by.y = "id")
Y <- merge(Y, df, by.x = "Var2", by.y = "id")
return(Y)
}
对比组合
CP <- function(df) {
Did = as.data.frame(t(combn(df$id, 2)))
Z <- merge(Did, df, by.x = "V1", by.y = "id")
Z <- merge(Z, df, by.x = "V2", by.y = "id")
return(Z)
}
比较
dis.long <- OP(x2)
object.size(dis.long)
# 7320 bytes
new <- CP(x2)
object.size(new)
# 5016 bytes
更大的例子
num <- 5e2
bigx <- data.frame(latitude=rnorm(num)*-23, longitude=rnorm(num)*-46, id=1:num)
bigdl <- OP(bigx)
object.size(bigdl)
# 10001224 bytes
bignew <- CP(bigx)
object.size(bignew)
# 4991224 bytes
大约一半大小