R,计算两个数据集的最小欧氏距离,并自动标记
R, compute the smallest Euclidean Distance for two dataset, and label it automatically
我正在使用 Euclidean Distance 处理一对数据集。
首先是我的数据。
centers <- data.frame(x_ce = c(300,180,450,500),
y_ce = c(23,15,10,20),
center = c('a','b','c','d'))
points <- data.frame(point = c('p1','p2','p3','p4'),
x_p = c(160,600,400,245),
y_p = c(7,23,56,12))
我的目标是为points
中的每个点找到与centers
中所有中心的最小距离,并附加中心名称到 points
数据集(显然是最小的数据集),并使此过程自动化。
所以我从基础开始:
#Euclidean distance
sqrt(sum((x-y)^2))
事实上我已经想到了它应该如何工作,但我不知道如何让它自动运行。
- 选择
points
的一行,centers
的所有行
- 计算
centers
行与每一行之间的欧氏距离
- 选择最小距离
- 附上距离最小的标签
- 对第二行重复...直到
points
结束
所以我设法手动完成,完成所有步骤使其自动完成:
# 1.
x = (points[1,2:3]) # select the first of points
y1 = (centers[1,1:2]) # select the first center
y2 = (centers[2,1:2]) # select the second center
y3 = (centers[3,1:2]) # select the third center
y4 = (centers[4,1:2]) # select the fourth center
# 2.
# then the distances
distances <- data.frame(distance = c(
sqrt(sum((x-y1)^2)),
sqrt(sum((x-y2)^2)),
sqrt(sum((x-y3)^2)),
sqrt(sum((x-y4)^2))),
center = centers$center
)
# 3.
# then I choose the row with the smallest distance
d <- distances[which(distances$distance==min(distances$distance)),]
# 4.
# last, I put the label near the point
cbind(points[1,],d)
# 5.
# then I restart for the second point
问题是我无法自动管理它。你有没有想过让这个过程对 points
的每个点自动进行?
此外,我是在重新发明轮子吗,即它是否存在我不知道的更快的过程(作为函数)?
centers <- data.frame(x_ce = c(300,180,450,500),
y_ce = c(23,15,10,20),
center = c('a','b','c','d'))
points <- data.frame(point = c('p1','p2','p3','p4'),
x_p = c(160,600,400,245),
y_p = c(7,23,56,12))
library(tidyverse)
points %>%
mutate(c = list(centers)) %>%
unnest() %>% # create all posible combinations of points and centers as a dataframe
rowwise() %>% # for each combination
mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>% # calculate distance
ungroup() %>% # forget the grouping
group_by(point, x_p, y_p) %>% # for each point
summarise(closest_center = center[d == min(d)]) %>% # keep the closest center
ungroup() # forget the grouping
# # A tibble: 4 x 4
# point x_p y_p closest_center
# <fct> <dbl> <dbl> <fct>
# 1 p1 160 7 b
# 2 p2 600 23 d
# 3 p3 400 56 c
# 4 p4 245 12 a
使用dplyr
包,可以用group_by
遍历每个点,用mutate
形成距离列表,设置distance
为最小值列表,并将 center
设置为最小距离中心的名称。对于重复行或点名称的情况,我提供了两个备选方案。
library(dplyr)
centers <- data.frame(x_ce = c(300,180,450,500),
y_ce = c(23,15,10,20),
center = c('a','b','c','d'))
points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
x_p = c(160,600,400,245, 245),
y_p = c(7,23,56,12, 12))
#
# If duplicate rows need to be removed
#
result1 <- points %>% group_by(point) %>% distinct() %>%
mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ),
distance=min(unlist(lst)),
center = centers$center[which.min(unlist(lst))]) %>%
select(-lst)
给出结果
# A tibble: 4 x 5
# Groups: point [4]
point x_p y_p distance center
<fct> <dbl> <dbl> <dbl> <fct>
1 p1 160 7 21.5 b
2 p2 600 23 100. d
3 p3 400 56 67.9 c
4 p4 245 12 56.1 a
和
#
# Alternative if point names are not unique
#
points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
x_p = c(160,600,400,245, 550),
y_p = c(7,23,56,12, 25))
result2 <- points %>% rowwise() %>%
mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ),
distance=min(unlist(lst)),
center = centers$center[which.min(unlist(lst))]) %>%
ungroup() %>% select(-lst)
结果
# A tibble: 5 x 5
point x_p y_p distance center
<fct> <dbl> <dbl> <dbl> <fct>
1 p1 160 7 21.5 b
2 p2 600 23 100. d
3 p3 400 56 67.9 c
4 p4 245 12 56.1 a
5 p4 550 25 50.2 d
我正在使用 Euclidean Distance 处理一对数据集。 首先是我的数据。
centers <- data.frame(x_ce = c(300,180,450,500),
y_ce = c(23,15,10,20),
center = c('a','b','c','d'))
points <- data.frame(point = c('p1','p2','p3','p4'),
x_p = c(160,600,400,245),
y_p = c(7,23,56,12))
我的目标是为points
中的每个点找到与centers
中所有中心的最小距离,并附加中心名称到 points
数据集(显然是最小的数据集),并使此过程自动化。
所以我从基础开始:
#Euclidean distance
sqrt(sum((x-y)^2))
事实上我已经想到了它应该如何工作,但我不知道如何让它自动运行。
- 选择
points
的一行,centers
的所有行 - 计算
centers
行与每一行之间的欧氏距离
- 选择最小距离
- 附上距离最小的标签
- 对第二行重复...直到
points
结束
所以我设法手动完成,完成所有步骤使其自动完成:
# 1.
x = (points[1,2:3]) # select the first of points
y1 = (centers[1,1:2]) # select the first center
y2 = (centers[2,1:2]) # select the second center
y3 = (centers[3,1:2]) # select the third center
y4 = (centers[4,1:2]) # select the fourth center
# 2.
# then the distances
distances <- data.frame(distance = c(
sqrt(sum((x-y1)^2)),
sqrt(sum((x-y2)^2)),
sqrt(sum((x-y3)^2)),
sqrt(sum((x-y4)^2))),
center = centers$center
)
# 3.
# then I choose the row with the smallest distance
d <- distances[which(distances$distance==min(distances$distance)),]
# 4.
# last, I put the label near the point
cbind(points[1,],d)
# 5.
# then I restart for the second point
问题是我无法自动管理它。你有没有想过让这个过程对 points
的每个点自动进行?
此外,我是在重新发明轮子吗,即它是否存在我不知道的更快的过程(作为函数)?
centers <- data.frame(x_ce = c(300,180,450,500),
y_ce = c(23,15,10,20),
center = c('a','b','c','d'))
points <- data.frame(point = c('p1','p2','p3','p4'),
x_p = c(160,600,400,245),
y_p = c(7,23,56,12))
library(tidyverse)
points %>%
mutate(c = list(centers)) %>%
unnest() %>% # create all posible combinations of points and centers as a dataframe
rowwise() %>% # for each combination
mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>% # calculate distance
ungroup() %>% # forget the grouping
group_by(point, x_p, y_p) %>% # for each point
summarise(closest_center = center[d == min(d)]) %>% # keep the closest center
ungroup() # forget the grouping
# # A tibble: 4 x 4
# point x_p y_p closest_center
# <fct> <dbl> <dbl> <fct>
# 1 p1 160 7 b
# 2 p2 600 23 d
# 3 p3 400 56 c
# 4 p4 245 12 a
使用dplyr
包,可以用group_by
遍历每个点,用mutate
形成距离列表,设置distance
为最小值列表,并将 center
设置为最小距离中心的名称。对于重复行或点名称的情况,我提供了两个备选方案。
library(dplyr)
centers <- data.frame(x_ce = c(300,180,450,500),
y_ce = c(23,15,10,20),
center = c('a','b','c','d'))
points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
x_p = c(160,600,400,245, 245),
y_p = c(7,23,56,12, 12))
#
# If duplicate rows need to be removed
#
result1 <- points %>% group_by(point) %>% distinct() %>%
mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ),
distance=min(unlist(lst)),
center = centers$center[which.min(unlist(lst))]) %>%
select(-lst)
给出结果
# A tibble: 4 x 5
# Groups: point [4]
point x_p y_p distance center
<fct> <dbl> <dbl> <dbl> <fct>
1 p1 160 7 21.5 b
2 p2 600 23 100. d
3 p3 400 56 67.9 c
4 p4 245 12 56.1 a
和
#
# Alternative if point names are not unique
#
points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
x_p = c(160,600,400,245, 550),
y_p = c(7,23,56,12, 25))
result2 <- points %>% rowwise() %>%
mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ),
distance=min(unlist(lst)),
center = centers$center[which.min(unlist(lst))]) %>%
ungroup() %>% select(-lst)
结果
# A tibble: 5 x 5
point x_p y_p distance center
<fct> <dbl> <dbl> <dbl> <fct>
1 p1 160 7 21.5 b
2 p2 600 23 100. d
3 p3 400 56 67.9 c
4 p4 245 12 56.1 a
5 p4 550 25 50.2 d