以更快的方式计算欧氏距离
Calculate euclidean distance in a faster way
我想计算具有 30.000 个观测值的数据框行之间的欧氏距离。一种简单的方法是使用 dist 函数(例如 dist(data))。但是,由于我的数据框很大,这需要太多时间。
部分行包含缺失值。我不需要行之间的距离,其中两行都包含缺失值,或者行之间的距离,其中 none 行包含缺失值。
在for循环中,我试图排除不需要的组合。不幸的是,我的解决方案需要更多时间:
# Some example data
data <- data.frame(
x1 = c(1, 22, NA, NA, 15, 7, 10, 8, NA, 5),
x2 = c(11, 2, 7, 15, 1, 17, 11, 18, 5, 5),
x3 = c(21, 5, 6, NA, 10, 22, 12, 2, 12, 3),
x4 = c(13, NA, NA, 20, 12, 5, 1, 8, 7, 14)
)
# Measure speed of dist() function
start_time_dist <- Sys.time()
# Calculate euclidean distance with dist() function for complete dataset
dist_results <- dist(data)
end_time_dist <- Sys.time()
time_taken_dist <- end_time_dist - start_time_dist
# Measure speed of my own loop
start_time_own <- Sys.time()
# Calculate euclidean distance with my own loop only for specific cases
# # #
# The following code should be faster!
# # #
data_cc <- data[complete.cases(data), ]
data_miss <- data[complete.cases(data) == FALSE, ]
distance_list <- list()
for(i in 1:nrow(data_miss)) {
distances <- numeric()
for(j in 1:nrow(data_cc)) {
distances <- c(distances, dist(rbind(data_miss[i, ], data_cc[j, ]), method = "euclidean"))
}
distance_list[[i]] <- distances
}
end_time_own <- Sys.time()
time_taken_own <- end_time_own - start_time_own
# Compare speed of both calculations
time_taken_dist # 0.002001047 secs
time_taken_own # 0.01562881 secs
有没有更快的方法来计算我需要的欧氏距离?
我推荐你使用并行计算。将所有代码放在一个函数中并并行执行。
R 默认在一个线程中完成所有计算。您应该手动添加并行线程。在 R 中启动集群需要时间,但如果你有大数据框,主要工作的性能将快 (your_processors_number-1) 倍。
此链接也可能有帮助:How-to go parallel in R – basics + tips and A gentle introduction to parallel computing in R。
不错的选择是将您的作业分成更小的包,并在每个线程中单独计算它们。只创建一次线程,因为在R中很耗时。
library(parallel)
library(foreach)
library(doParallel)
# I am not sure that all libraries are here
# try ??your function to determine which library do you need
# determine how many processors has your computer
no_cores <- detectCores() - 1# one processor must be free always for system
start.t.total<-Sys.time()
print(start.t.total)
startt<-Sys.time()
print(startt)
#start parallel calculations
cl<-makeCluster(no_cores,outfile = "mycalculation_debug.txt")
registerDoParallel(cl)
# results will be in out.df class(dataframe)
out.df<-foreach(p=1:no_cores
,.combine=rbind # data from different threads will be in one table
,.packages=c()# All packages that your funtion is using must be called here
,.inorder=T) %dopar% #don`t forget this directive
{
tryCatch({
#
# enter your function here and do what you want in parallel
#
print(startt-Sys.time())
print(start.t.total-Sys.time())
print(paste(date,'packet',p, percent((x-istart)/packes[p]),'done'))
}
out.df
},error = function(e) return(paste0("The variable '", p, "'",
" caused the error: '", e, "'")))
}
stopCluster(cl)
gc()# force to free memory from killed processes
我想计算具有 30.000 个观测值的数据框行之间的欧氏距离。一种简单的方法是使用 dist 函数(例如 dist(data))。但是,由于我的数据框很大,这需要太多时间。
部分行包含缺失值。我不需要行之间的距离,其中两行都包含缺失值,或者行之间的距离,其中 none 行包含缺失值。
在for循环中,我试图排除不需要的组合。不幸的是,我的解决方案需要更多时间:
# Some example data
data <- data.frame(
x1 = c(1, 22, NA, NA, 15, 7, 10, 8, NA, 5),
x2 = c(11, 2, 7, 15, 1, 17, 11, 18, 5, 5),
x3 = c(21, 5, 6, NA, 10, 22, 12, 2, 12, 3),
x4 = c(13, NA, NA, 20, 12, 5, 1, 8, 7, 14)
)
# Measure speed of dist() function
start_time_dist <- Sys.time()
# Calculate euclidean distance with dist() function for complete dataset
dist_results <- dist(data)
end_time_dist <- Sys.time()
time_taken_dist <- end_time_dist - start_time_dist
# Measure speed of my own loop
start_time_own <- Sys.time()
# Calculate euclidean distance with my own loop only for specific cases
# # #
# The following code should be faster!
# # #
data_cc <- data[complete.cases(data), ]
data_miss <- data[complete.cases(data) == FALSE, ]
distance_list <- list()
for(i in 1:nrow(data_miss)) {
distances <- numeric()
for(j in 1:nrow(data_cc)) {
distances <- c(distances, dist(rbind(data_miss[i, ], data_cc[j, ]), method = "euclidean"))
}
distance_list[[i]] <- distances
}
end_time_own <- Sys.time()
time_taken_own <- end_time_own - start_time_own
# Compare speed of both calculations
time_taken_dist # 0.002001047 secs
time_taken_own # 0.01562881 secs
有没有更快的方法来计算我需要的欧氏距离?
我推荐你使用并行计算。将所有代码放在一个函数中并并行执行。
R 默认在一个线程中完成所有计算。您应该手动添加并行线程。在 R 中启动集群需要时间,但如果你有大数据框,主要工作的性能将快 (your_processors_number-1) 倍。
此链接也可能有帮助:How-to go parallel in R – basics + tips and A gentle introduction to parallel computing in R。
不错的选择是将您的作业分成更小的包,并在每个线程中单独计算它们。只创建一次线程,因为在R中很耗时。
library(parallel)
library(foreach)
library(doParallel)
# I am not sure that all libraries are here
# try ??your function to determine which library do you need
# determine how many processors has your computer
no_cores <- detectCores() - 1# one processor must be free always for system
start.t.total<-Sys.time()
print(start.t.total)
startt<-Sys.time()
print(startt)
#start parallel calculations
cl<-makeCluster(no_cores,outfile = "mycalculation_debug.txt")
registerDoParallel(cl)
# results will be in out.df class(dataframe)
out.df<-foreach(p=1:no_cores
,.combine=rbind # data from different threads will be in one table
,.packages=c()# All packages that your funtion is using must be called here
,.inorder=T) %dopar% #don`t forget this directive
{
tryCatch({
#
# enter your function here and do what you want in parallel
#
print(startt-Sys.time())
print(start.t.total-Sys.time())
print(paste(date,'packet',p, percent((x-istart)/packes[p]),'done'))
}
out.df
},error = function(e) return(paste0("The variable '", p, "'",
" caused the error: '", e, "'")))
}
stopCluster(cl)
gc()# force to free memory from killed processes