寻找两个大数据集之间最接近的坐标
Finding closest coordinates between two large data sets
我的目标是根据两个数据集中的坐标,识别数据集 2 中与数据集 1 中每个条目最近的条目。数据集 1 包含 180,000 行(仅 1,800 个唯一坐标),数据集 2 包含 4,500 行(完整的 4,500 个唯一坐标)。
我试图复制 Whosebug 上类似问题的答案。例如:
R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long
Calculating the distance between points in different data frames
但是这些并没有以我想要的方式解决问题(它们要么加入数据框,要么检查单个数据框内的距离)。
and related posts 中的解决方案是迄今为止我找到的最接近的解决方案。
我对 post 的问题是它计算出单个数据帧内坐标之间的距离,我一直无法理解要在 RANN::nn2
中更改哪些参数以跨两个数据帧执行此操作数据框。
建议的无效代码:
library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)
Notes/Questions:
1) 应向查询提供哪个数据集以找到数据集 2 中最接近数据集 1 中给定值的值?
2) 有什么方法可以避免数据集似乎需要相同宽度(列数)的问题吗?
3) 如何将输出(SRD_ID
和 distance
)添加到数据集 1 中的相关条目?
4) RANN::nn2
函数中的eps
参数有什么用?
目的是使用数据集 2 中最近的站点 ID 以及数据集 1 中的条目与数据集 2 中最近的条目之间的距离填充数据集 1 中的 SRC_ID
和 distance
列.
下面是 table 演示预期结果。 注意:SRC_ID
和 distance
值是我自己手动添加的示例值,几乎肯定是不正确的,代码可能无法复制。
id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987 52.88121 -2.873734 55 350
2 3798045 53.80945 -2.439163 76 2100
资料:
r 详情
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
数据集1输入(未缩小到唯一坐标)
structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
数据集2输入
structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
我写了一个参考这个 thread 的答案。修改该函数以负责报告距离并避免硬编码。请注意,它计算的是 欧氏距离 .
library(data.table)
#Euclidean distance
mydist <- function(a, b, df1, x, y){
dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))
return(data.table(Closest.V1 = which.min(dt$V1),
Distance = dt[which.min(dt$V1)]))
}
setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2),
"HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]
# id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
# 1: 1 52.88144 -2.873778 5 0.7990743
# 2: 2 57.80945 -2.234544 8 2.1676868
# 3: 4 34.02335 -3.098445 10 1.4758202
# 4: 5 63.80879 -2.439163 3 4.2415854
# 5: 6 53.68881 -7.396112 2 3.6445416
# 6: 7 63.44628 -5.162345 3 2.3577811
# 7: 8 21.60755 -8.633113 9 8.2123762
# 8: 9 78.32444 3.813290 7 11.4936496
# 9: 10 66.85533 -3.994326 1 1.9296370
# 10: 3 51.62354 -8.906553 2 3.2180026
您可以使用 RANN::nn2
,但您需要确保使用正确的语法。后续作品!
as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
# nn.idx nn.dists
# 1 5 0.7990743
# 2 8 2.1676868
# 3 10 1.4758202
# 4 3 4.2415854
# 5 2 3.6445416
# 6 3 2.3577811
# 7 9 8.2123762
# 8 7 11.4936496
# 9 1 1.9296370
# 10 2 3.2180026
数据
x = structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
y = structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
解决方案。请注意“3:2”以该顺序获得 "longitude / latitude"。
library(raster)
d <- pointDistance(x[,3:2], y[,3:2], lonlat=TRUE, allpairs=T)
i <- apply(d, 1, which.min)
x$SRC_ID = y$SRC_ID[i]
x$distance = d[cbind(1:nrow(d), i)]
x
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
#1 1 52.88144 -2.873778 44 74680.48
#2 2 57.80945 -2.234544 5688 238553.51
#3 4 34.02335 -3.098445 61114 137385.18
#4 5 63.80879 -2.439163 23 340642.70
#5 6 53.68881 -7.396112 44 308458.73
#6 7 63.44628 -5.162345 23 256176.88
#7 8 21.60755 -8.633113 440 908292.28
#8 9 78.32444 3.813290 76 1064419.47
#9 10 66.85533 -3.994326 55 185119.29
#10 3 51.62354 -8.906553 54 251580.45
图文并茂
plot(x[,3:2], ylim=c(0,90), col="blue", pch=20)
points(y[,3:2], col="red", pch=20)
for (i in 1:nrow(x)) {
j <- y$SRC_ID==x$SRC_ID[i]
arrows(x[i,3], x[i,2], y[j,3], y[j,2],length=.1)
}
text(x[,3:2], labels=x$id, pos=1, cex=.75)
text(y[,3:2], labels=y$SRC_ID, pos=3, cex=.75)
我的目标是根据两个数据集中的坐标,识别数据集 2 中与数据集 1 中每个条目最近的条目。数据集 1 包含 180,000 行(仅 1,800 个唯一坐标),数据集 2 包含 4,500 行(完整的 4,500 个唯一坐标)。
我试图复制 Whosebug 上类似问题的答案。例如:
R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long
Calculating the distance between points in different data frames
但是这些并没有以我想要的方式解决问题(它们要么加入数据框,要么检查单个数据框内的距离)。
我对 post 的问题是它计算出单个数据帧内坐标之间的距离,我一直无法理解要在 RANN::nn2
中更改哪些参数以跨两个数据帧执行此操作数据框。
建议的无效代码:
library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)
Notes/Questions:
1) 应向查询提供哪个数据集以找到数据集 2 中最接近数据集 1 中给定值的值?
2) 有什么方法可以避免数据集似乎需要相同宽度(列数)的问题吗?
3) 如何将输出(SRD_ID
和 distance
)添加到数据集 1 中的相关条目?
4) RANN::nn2
函数中的eps
参数有什么用?
目的是使用数据集 2 中最近的站点 ID 以及数据集 1 中的条目与数据集 2 中最近的条目之间的距离填充数据集 1 中的 SRC_ID
和 distance
列.
下面是 table 演示预期结果。 注意:SRC_ID
和 distance
值是我自己手动添加的示例值,几乎肯定是不正确的,代码可能无法复制。
id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987 52.88121 -2.873734 55 350
2 3798045 53.80945 -2.439163 76 2100
资料:
r 详情
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
数据集1输入(未缩小到唯一坐标)
structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
数据集2输入
structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
我写了一个参考这个 thread 的答案。修改该函数以负责报告距离并避免硬编码。请注意,它计算的是 欧氏距离 .
library(data.table)
#Euclidean distance
mydist <- function(a, b, df1, x, y){
dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))
return(data.table(Closest.V1 = which.min(dt$V1),
Distance = dt[which.min(dt$V1)]))
}
setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2),
"HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]
# id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
# 1: 1 52.88144 -2.873778 5 0.7990743
# 2: 2 57.80945 -2.234544 8 2.1676868
# 3: 4 34.02335 -3.098445 10 1.4758202
# 4: 5 63.80879 -2.439163 3 4.2415854
# 5: 6 53.68881 -7.396112 2 3.6445416
# 6: 7 63.44628 -5.162345 3 2.3577811
# 7: 8 21.60755 -8.633113 9 8.2123762
# 8: 9 78.32444 3.813290 7 11.4936496
# 9: 10 66.85533 -3.994326 1 1.9296370
# 10: 3 51.62354 -8.906553 2 3.2180026
您可以使用 RANN::nn2
,但您需要确保使用正确的语法。后续作品!
as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
# nn.idx nn.dists
# 1 5 0.7990743
# 2 8 2.1676868
# 3 10 1.4758202
# 4 3 4.2415854
# 5 2 3.6445416
# 6 3 2.3577811
# 7 9 8.2123762
# 8 7 11.4936496
# 9 1 1.9296370
# 10 2 3.2180026
数据
x = structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
y = structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
解决方案。请注意“3:2”以该顺序获得 "longitude / latitude"。
library(raster)
d <- pointDistance(x[,3:2], y[,3:2], lonlat=TRUE, allpairs=T)
i <- apply(d, 1, which.min)
x$SRC_ID = y$SRC_ID[i]
x$distance = d[cbind(1:nrow(d), i)]
x
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
#1 1 52.88144 -2.873778 44 74680.48
#2 2 57.80945 -2.234544 5688 238553.51
#3 4 34.02335 -3.098445 61114 137385.18
#4 5 63.80879 -2.439163 23 340642.70
#5 6 53.68881 -7.396112 44 308458.73
#6 7 63.44628 -5.162345 23 256176.88
#7 8 21.60755 -8.633113 440 908292.28
#8 9 78.32444 3.813290 76 1064419.47
#9 10 66.85533 -3.994326 55 185119.29
#10 3 51.62354 -8.906553 54 251580.45
图文并茂
plot(x[,3:2], ylim=c(0,90), col="blue", pch=20)
points(y[,3:2], col="red", pch=20)
for (i in 1:nrow(x)) {
j <- y$SRC_ID==x$SRC_ID[i]
arrows(x[i,3], x[i,2], y[j,3], y[j,2],length=.1)
}
text(x[,3:2], labels=x$id, pos=1, cex=.75)
text(y[,3:2], labels=y$SRC_ID, pos=3, cex=.75)