R,计算两个数据集的最小欧氏距离,并自动标记

R, compute the smallest Euclidean Distance for two dataset, and label it automatically

我正在使用 Euclidean Distance 处理一对数据集。 首先是我的数据。

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

我的目标是为points中的每个点找到与centers中所有中心的最小距离,并附加中心名称到 points 数据集(显然是最小的数据集),并使此过程自动化。

所以我从基础开始:

#Euclidean distance
sqrt(sum((x-y)^2))

事实上我已经想到了它应该如何工作,但我不知道如何让它自动运行。

  1. 选择points的一行,centers的所有行
  2. 计算centers
  3. 行与每一行之间的欧氏距离
  4. 选择最小距离
  5. 附上距离最小的标签
  6. 对第二行重复...直到 points
  7. 结束

所以我设法手动完成,完成所有步骤使其自动完成:

# 1.  
x = (points[1,2:3])   # select the first of points
y1 = (centers[1,1:2]) # select the first center
y2 = (centers[2,1:2]) # select the second center
y3 = (centers[3,1:2]) # select the third center
y4 = (centers[4,1:2]) # select the fourth center

# 2.
# then the distances
distances <- data.frame(distance = c(
                                    sqrt(sum((x-y1)^2)),
                                    sqrt(sum((x-y2)^2)),
                                    sqrt(sum((x-y3)^2)),
                                    sqrt(sum((x-y4)^2))),
                                    center = centers$center
                                    )

# 3.
# then I choose the row with the smallest distance
d <- distances[which(distances$distance==min(distances$distance)),]

# 4.
# last, I put the label near the point
cbind(points[1,],d)

# 5. 
# then I restart for the second point

问题是我无法自动管理它。你有没有想过让这个过程对 points 的每个点自动进行? 此外,我是在重新发明轮子吗,即它是否存在我不知道的更快的过程(作为函数)?

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

library(tidyverse)

points %>%
  mutate(c = list(centers)) %>%
  unnest() %>%                       # create all posible combinations of points and centers as a dataframe
  rowwise() %>%                      # for each combination
  mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>%   # calculate distance
  ungroup() %>%                                            # forget the grouping
  group_by(point, x_p, y_p) %>%                            # for each point
  summarise(closest_center = center[d == min(d)]) %>%      # keep the closest center
  ungroup()                                                # forget the grouping

# # A tibble: 4 x 4
#   point   x_p   y_p closest_center
#   <fct> <dbl> <dbl> <fct>         
# 1 p1      160     7 b             
# 2 p2      600    23 d             
# 3 p3      400    56 c             
# 4 p4      245    12 a

使用dplyr包,可以用group_by遍历每个点,用mutate形成距离列表,设置distance为最小值列表,并将 center 设置为最小距离中心的名称。对于重复行或点名称的情况,我提供了两个备选方案。

    library(dplyr)
   centers <- data.frame(x_ce = c(300,180,450,500),
                        y_ce = c(23,15,10,20),
                        center = c('a','b','c','d'))
   points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 245),
                       y_p = c(7,23,56,12, 12))
#
#  If duplicate rows need to be removed
#
  result1 <- points %>% group_by(point) %>%  distinct() %>% 
                                  mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                                  distance=min(unlist(lst)), 
                                  center = centers$center[which.min(unlist(lst))]) %>%
             select(-lst)

给出结果

# A tibble: 4 x 5
# Groups:   point [4]
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a 

#
# Alternative if point names are not unique
#
  points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 550),
                       y_p = c(7,23,56,12, 25))
  result2 <- points %>% rowwise() %>%
                    mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                               distance=min(unlist(lst)), 
                              center = centers$center[which.min(unlist(lst))]) %>%
                    ungroup() %>% select(-lst)

结果

# A tibble: 5 x 5
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a     
5 p4      550    25     50.2 d