R:用于大数据的 Distm?计算两个矩阵之间的最小距离
R: Distm for big data? Calculating minimum distances between two matrices
我有两个矩阵,一个是 200K 行,另一个是 20K。对于第一个矩阵中的每一行(这是一个点),我试图找到第二个矩阵中的哪一行(也是一个点)最接近第一个矩阵中的点。这是我在示例数据集上尝试的第一种方法:
#Test dataset
pixels.latlon=cbind(runif(200000,min=-180, max=-120), runif(200000, min=50, max=85))
grwl.latlon=cbind(runif(20000,min=-180, max=-120), runif(20000, min=50, max=85))
#calculate the distance matrix
library(geosphere)
dist.matrix=distm(pixels.latlon, grwl.latlon, fun=distHaversine)
#Pick out the indices of the minimum distance
rnum=apply(dist.matrix, 1, which.min)
但是,当我使用 distm
函数时出现 Error: cannot allocate vector of size 30.1 Gb
错误。
关于这个话题有几篇帖子:
这个使用 bigmemory
来计算 SAME 数据帧中点之间的距离,但我不确定如何调整它来计算两个不同矩阵中点之间的距离...https://stevemosher.wordpress.com/2012/04/12/nick-stokes-distance-code-now-with-big-memory/
这也适用于计算 SAME 矩阵中各点之间的距离矩阵...Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
这个和我想做的几乎一模一样,但他们实际上并没有想出一个适用于大数据的解决方案:我试过这个方法,它使用 bigmemory
,但出现 Error in CreateFileBackedBigMatrix(as.character(backingfile), as.character(backingpath), :
Problem creating filebacked matrix.
错误,我认为是因为数据帧太大。
有没有人想出好的办法来解决这个问题?我对其他包装创意持开放态度!
修复问题的更新代码
pixels.latlon=cbind(runif(200000,min=-180, max=-120), runif(200000, min=50, max=85))
grwl.tibble = tibble(long=runif(20000,min=-180, max=-120), lat=runif(20000, min=50, max=85), id=runif(20000, min=0, max=20000))
rnum <- apply(pixels.latlon, 1, function(x) {
xlon=x[1]
xlat=x[2]
grwl.filt = grwl.tibble %>%
filter(long < (xlon+0.3) & long >(xlon-0.3) & lat < (xlat+0.3)&lat >(xlat-.3))
grwl.latlon.filt = cbind(grwl.filt$long, grwl.filt$lat)
dm <- distm(x, grwl.latlon.filt, fun=distHaversine)
rnum=apply(dm, 1, which.min)
id = grwl.filt$id[rnum]
return(id)
})
这会使用更少的内存,因为它一次处理一行,而不是创建完整的距离矩阵(尽管它会更慢)
library(geosphere)
rnum <- apply(pixels.latlon, 1, function(x) {
dm <- distm(x, grwl.latlon, fun=distHaversine)
return(which.min(dm))
})
大部分时间都花在了复杂的 Haversine 公式上。由于您真的只对找到最近的点感兴趣,而不是精确的距离,我们可以使用更简单的距离度量。这里有一个替代方案,使用基于这篇文章 http://jonisalonen.com/2014/computing-distance-between-coordinates-can-be-simple-and-fast/ 的公式,并且还使用余弦的二次近似(它本身计算起来很昂贵)...
#quadratic cosine approximation using lm (run once)
qcos <- lm(y~x+I(x^2), data.frame(x=0:90, y=cos((0:90)*2*pi/360)))$coefficients
cosadj <- function(lat) qcos[1]+lat*(qcos[2]+qcos[3]*lat)
#define rough dist function
roughDist <- function(x,y){#x should be a single (lon,lat), y a (n*2) matrix of (lon,lat)
latDev <- x[2]-y[,2]
lonDev <- (x[1]-y[,1])*cosadj(abs(x[2]))
return(latDev*latDev+lonDev*lonDev) #don't need the usual square root or any scaling parameters
}
然后你可以用这个新函数替换 Haversine...
rnum <- apply(pixels.latlon, 1, function(x) {
dm <- distm(x, grwl.latlon, fun=roughDist)
return(which.min(dm))
})
在我的机器上,它的运行速度比 Haversine 版本快三倍。
您可以使用这个 R(cpp) 函数:
#include <Rcpp.h>
using namespace Rcpp;
double compute_a(double lat1, double long1, double lat2, double long2) {
double sin_dLat = ::sin((lat2 - lat1) / 2);
double sin_dLon = ::sin((long2 - long1) / 2);
return sin_dLat * sin_dLat + ::cos(lat1) * ::cos(lat2) * sin_dLon * sin_dLon;
}
int find_min(double lat1, double long1,
const NumericVector& lat2,
const NumericVector& long2,
int current0) {
int m = lat2.size();
double lat_k, lat_min, lat_max, a, a0;
int k, current = current0;
a0 = compute_a(lat1, long1, lat2[current], long2[current]);
// Search before current0
lat_min = lat1 - 2 * ::asin(::sqrt(a0));
for (k = current0 - 1; k >= 0; k--) {
lat_k = lat2[k];
if (lat_k > lat_min) {
a = compute_a(lat1, long1, lat_k, long2[k]);
if (a < a0) {
a0 = a;
current = k;
lat_min = lat1 - 2 * ::asin(::sqrt(a0));
}
} else {
// No need to search further
break;
}
}
// Search after current0
lat_max = lat1 + 2 * ::asin(::sqrt(a0));
for (k = current0 + 1; k < m; k++) {
lat_k = lat2[k];
if (lat_k < lat_max) {
a = compute_a(lat1, long1, lat_k, long2[k]);
if (a < a0) {
a0 = a;
current = k;
lat_max = lat1 + 2 * ::asin(::sqrt(a0));
}
} else {
// No need to search further
break;
}
}
return current;
}
// [[Rcpp::export]]
IntegerVector find_closest_point(const NumericVector& lat1,
const NumericVector& long1,
const NumericVector& lat2,
const NumericVector& long2) {
int n = lat1.size();
IntegerVector res(n);
int current = 0;
for (int i = 0; i < n; i++) {
res[i] = current = find_min(lat1[i], long1[i], lat2, long2, current);
}
return res; // need +1
}
/*** R
N <- 2000 # 2e6
M <- 500 # 2e4
pixels.latlon=cbind(runif(N,min=-180, max=-120), runif(N, min=50, max=85))
grwl.latlon=cbind(runif(M,min=-180, max=-120), runif(M, min=50, max=85))
# grwl.latlon <- grwl.latlon[order(grwl.latlon[, 2]), ]
library(geosphere)
system.time({
#calculate the distance matrix
dist.matrix = distm(pixels.latlon, grwl.latlon, fun=distHaversine)
#Pick out the indices of the minimum distance
rnum=apply(dist.matrix, 1, which.min)
})
find_closest <- function(lat1, long1, lat2, long2) {
toRad <- pi / 180
lat1 <- lat1 * toRad
long1 <- long1 * toRad
lat2 <- lat2 * toRad
long2 <- long2 * toRad
ord1 <- order(lat1)
rank1 <- match(seq_along(lat1), ord1)
ord2 <- order(lat2)
ind <- find_closest_point(lat1[ord1], long1[ord1], lat2[ord2], long2[ord2])
ord2[ind + 1][rank1]
}
system.time(
test <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1],
grwl.latlon[, 2], grwl.latlon[, 1])
)
all.equal(test, rnum)
N <- 2e4
M <- 2e4
pixels.latlon=cbind(runif(N,min=-180, max=-120), runif(N, min=50, max=85))
grwl.latlon=cbind(long = runif(M,min=-180, max=-120), lat = runif(M, min=50, max=85))
system.time(
test <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1],
grwl.latlon[, 2], grwl.latlon[, 1])
)
*/
N = 2e4
需要 0.5 秒,N = 2e5
需要 4.2 秒。
我无法让您的代码正常工作以进行比较。
我有两个矩阵,一个是 200K 行,另一个是 20K。对于第一个矩阵中的每一行(这是一个点),我试图找到第二个矩阵中的哪一行(也是一个点)最接近第一个矩阵中的点。这是我在示例数据集上尝试的第一种方法:
#Test dataset
pixels.latlon=cbind(runif(200000,min=-180, max=-120), runif(200000, min=50, max=85))
grwl.latlon=cbind(runif(20000,min=-180, max=-120), runif(20000, min=50, max=85))
#calculate the distance matrix
library(geosphere)
dist.matrix=distm(pixels.latlon, grwl.latlon, fun=distHaversine)
#Pick out the indices of the minimum distance
rnum=apply(dist.matrix, 1, which.min)
但是,当我使用 distm
函数时出现 Error: cannot allocate vector of size 30.1 Gb
错误。
关于这个话题有几篇帖子:
这个使用 bigmemory
来计算 SAME 数据帧中点之间的距离,但我不确定如何调整它来计算两个不同矩阵中点之间的距离...https://stevemosher.wordpress.com/2012/04/12/nick-stokes-distance-code-now-with-big-memory/
这也适用于计算 SAME 矩阵中各点之间的距离矩阵...Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
这个和我想做的几乎一模一样,但他们实际上并没有想出一个适用于大数据的解决方案:bigmemory
,但出现 Error in CreateFileBackedBigMatrix(as.character(backingfile), as.character(backingpath), :
Problem creating filebacked matrix.
错误,我认为是因为数据帧太大。
有没有人想出好的办法来解决这个问题?我对其他包装创意持开放态度!
修复问题的更新代码
pixels.latlon=cbind(runif(200000,min=-180, max=-120), runif(200000, min=50, max=85))
grwl.tibble = tibble(long=runif(20000,min=-180, max=-120), lat=runif(20000, min=50, max=85), id=runif(20000, min=0, max=20000))
rnum <- apply(pixels.latlon, 1, function(x) {
xlon=x[1]
xlat=x[2]
grwl.filt = grwl.tibble %>%
filter(long < (xlon+0.3) & long >(xlon-0.3) & lat < (xlat+0.3)&lat >(xlat-.3))
grwl.latlon.filt = cbind(grwl.filt$long, grwl.filt$lat)
dm <- distm(x, grwl.latlon.filt, fun=distHaversine)
rnum=apply(dm, 1, which.min)
id = grwl.filt$id[rnum]
return(id)
})
这会使用更少的内存,因为它一次处理一行,而不是创建完整的距离矩阵(尽管它会更慢)
library(geosphere)
rnum <- apply(pixels.latlon, 1, function(x) {
dm <- distm(x, grwl.latlon, fun=distHaversine)
return(which.min(dm))
})
大部分时间都花在了复杂的 Haversine 公式上。由于您真的只对找到最近的点感兴趣,而不是精确的距离,我们可以使用更简单的距离度量。这里有一个替代方案,使用基于这篇文章 http://jonisalonen.com/2014/computing-distance-between-coordinates-can-be-simple-and-fast/ 的公式,并且还使用余弦的二次近似(它本身计算起来很昂贵)...
#quadratic cosine approximation using lm (run once)
qcos <- lm(y~x+I(x^2), data.frame(x=0:90, y=cos((0:90)*2*pi/360)))$coefficients
cosadj <- function(lat) qcos[1]+lat*(qcos[2]+qcos[3]*lat)
#define rough dist function
roughDist <- function(x,y){#x should be a single (lon,lat), y a (n*2) matrix of (lon,lat)
latDev <- x[2]-y[,2]
lonDev <- (x[1]-y[,1])*cosadj(abs(x[2]))
return(latDev*latDev+lonDev*lonDev) #don't need the usual square root or any scaling parameters
}
然后你可以用这个新函数替换 Haversine...
rnum <- apply(pixels.latlon, 1, function(x) {
dm <- distm(x, grwl.latlon, fun=roughDist)
return(which.min(dm))
})
在我的机器上,它的运行速度比 Haversine 版本快三倍。
您可以使用这个 R(cpp) 函数:
#include <Rcpp.h>
using namespace Rcpp;
double compute_a(double lat1, double long1, double lat2, double long2) {
double sin_dLat = ::sin((lat2 - lat1) / 2);
double sin_dLon = ::sin((long2 - long1) / 2);
return sin_dLat * sin_dLat + ::cos(lat1) * ::cos(lat2) * sin_dLon * sin_dLon;
}
int find_min(double lat1, double long1,
const NumericVector& lat2,
const NumericVector& long2,
int current0) {
int m = lat2.size();
double lat_k, lat_min, lat_max, a, a0;
int k, current = current0;
a0 = compute_a(lat1, long1, lat2[current], long2[current]);
// Search before current0
lat_min = lat1 - 2 * ::asin(::sqrt(a0));
for (k = current0 - 1; k >= 0; k--) {
lat_k = lat2[k];
if (lat_k > lat_min) {
a = compute_a(lat1, long1, lat_k, long2[k]);
if (a < a0) {
a0 = a;
current = k;
lat_min = lat1 - 2 * ::asin(::sqrt(a0));
}
} else {
// No need to search further
break;
}
}
// Search after current0
lat_max = lat1 + 2 * ::asin(::sqrt(a0));
for (k = current0 + 1; k < m; k++) {
lat_k = lat2[k];
if (lat_k < lat_max) {
a = compute_a(lat1, long1, lat_k, long2[k]);
if (a < a0) {
a0 = a;
current = k;
lat_max = lat1 + 2 * ::asin(::sqrt(a0));
}
} else {
// No need to search further
break;
}
}
return current;
}
// [[Rcpp::export]]
IntegerVector find_closest_point(const NumericVector& lat1,
const NumericVector& long1,
const NumericVector& lat2,
const NumericVector& long2) {
int n = lat1.size();
IntegerVector res(n);
int current = 0;
for (int i = 0; i < n; i++) {
res[i] = current = find_min(lat1[i], long1[i], lat2, long2, current);
}
return res; // need +1
}
/*** R
N <- 2000 # 2e6
M <- 500 # 2e4
pixels.latlon=cbind(runif(N,min=-180, max=-120), runif(N, min=50, max=85))
grwl.latlon=cbind(runif(M,min=-180, max=-120), runif(M, min=50, max=85))
# grwl.latlon <- grwl.latlon[order(grwl.latlon[, 2]), ]
library(geosphere)
system.time({
#calculate the distance matrix
dist.matrix = distm(pixels.latlon, grwl.latlon, fun=distHaversine)
#Pick out the indices of the minimum distance
rnum=apply(dist.matrix, 1, which.min)
})
find_closest <- function(lat1, long1, lat2, long2) {
toRad <- pi / 180
lat1 <- lat1 * toRad
long1 <- long1 * toRad
lat2 <- lat2 * toRad
long2 <- long2 * toRad
ord1 <- order(lat1)
rank1 <- match(seq_along(lat1), ord1)
ord2 <- order(lat2)
ind <- find_closest_point(lat1[ord1], long1[ord1], lat2[ord2], long2[ord2])
ord2[ind + 1][rank1]
}
system.time(
test <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1],
grwl.latlon[, 2], grwl.latlon[, 1])
)
all.equal(test, rnum)
N <- 2e4
M <- 2e4
pixels.latlon=cbind(runif(N,min=-180, max=-120), runif(N, min=50, max=85))
grwl.latlon=cbind(long = runif(M,min=-180, max=-120), lat = runif(M, min=50, max=85))
system.time(
test <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1],
grwl.latlon[, 2], grwl.latlon[, 1])
)
*/
N = 2e4
需要 0.5 秒,N = 2e5
需要 4.2 秒。
我无法让您的代码正常工作以进行比较。