如何计算R中组内的欧氏距离
How to calculate euclidian distance within group in R
如果我有这样的数据框:
ID GroupID X Y
1 a 772.7778 226.5
1 a 806.5645 35.3871
1 a 925.5714 300.9286
1 b 708.0909 165.5455
1 b 630.8235 167.4118
2 a 555.3333 151.875
2 a 732.8947 462.3158
这是我想要的结果:
ID GroupID X Y Distance
1 a 772.7778 226.5 NA
1 a 806.5645 35.3871 dist between((772.7778,226.5),(806.5645,35.3871))
1 a 925.5714 300.9286 dist between((925.5714,300.9286),(806.5645,35.3871))
1 b 708.0909 165.5455 NA
1 b 630.8235 167.4118 dist between((708.0909,165.5455),(630.8235,167.4118))
2 a 555.3333 151.875 NA
2 a 732.8947 462.3158 dist between((732.8947,462.3158),(555.3333,151.875))
基本上就是ID和GroupID的距离。此处的 NA 表示在每个子组(例如 ID=1;GroupID=a)中,第一个距离为 NA。有没有人可以帮助我?谢谢!!!
这是dplyr
并使用dist
计算欧式距离的解决方案:
library(dplyr)
df <- read.table(text = "
ID GroupID X Y
1 a 772.7778 226.5
1 a 806.5645 35.3871
1 a 925.5714 300.9286
1 b 708.0909 165.5455
1 b 630.8235 167.4118
2 a 555.3333 151.875
2 a 732.8947 462.3158", header = T, stringsAsFactors = F)
df %>%
group_by(ID, GroupID) %>%
mutate(rows = row_number()) %>%
left_join(df, by = c('ID', 'GroupID')) %>%
rowwise() %>%
mutate(Distance = ifelse(dist(rbind(c(X.x, Y.x), c(X.y, Y.y))) != 0,
dist(rbind(c(X.x, Y.x), c(X.y, Y.y))),
NA)) %>%
filter(rows == 1) %>%
select(ID, GroupID, X = X.y, Y= Y.y, Distance)
## ID GroupID X Y Distance
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 a 772.7778 226.5000 NA
## 2 1 a 806.5645 35.3871 194.07648
## 3 1 a 925.5714 300.9286 169.95735
## 4 1 b 708.0909 165.5455 NA
## 5 1 b 630.8235 167.4118 77.28994
## 6 2 a 555.3333 151.8750 NA
## 7 2 a 732.8947 462.3158 357.63325
以前从未使用过 dist
,但这里有一个可能适合您的 for
循环:
> for(i in 1:nrow(df)) {
if(i > 1 && df$GroupID[i] == df$GroupID[i-1]) {
df$Distance[i] <- sqrt(((df$X[i] - df$X[i-1]) ^ 2) + ((df$Y[i] - df$Y[i-1]) ^ 2))
} else {
df$Distance[i] <- NA
}
}
> df
ID GroupID X Y Distance
1 1 a 772.7778 226.5000 NA
2 1 a 806.5645 35.3871 194.07648
3 1 a 925.5714 300.9286 290.98957
4 1 b 708.0909 165.5455 NA
5 1 b 630.8235 167.4118 77.28994
6 2 a 555.3333 151.8750 NA
7 2 a 732.8947 462.3158 357.63325
为什么不尝试这样的事情:
根据 ID 的组合拆分数据,应用距离函数,然后再拆分?
splitted <- split(dat[,c("X","Y")], paste(dat$ID,dat$GroupID))
distances <- lapply(splitted, function(x) {
if(nrow(x) > 2){ # diag() is useless for <= 2x2 matrix
c(NA,diag(as.matrix(dist(x))[,-1]))
} else {
c(NA,dist(x)[1])
}
})
dat$distances <- unsplit(distances, paste(dat$ID,dat$GroupID))
dat
ID GroupID X Y distances
1 1 a 772.7778 226.5000 NA
2 1 a 806.5645 35.3871 194.07648
3 1 a 925.5714 300.9286 290.98957
4 1 b 708.0909 165.5455 NA
5 1 b 630.8235 167.4118 77.28994
6 2 a 555.3333 151.8750 NA
7 2 a 732.8947 462.3158 357.63325
旁注:如果每个组超过 10k 行,dist 会变慢。
如果我有这样的数据框:
ID GroupID X Y
1 a 772.7778 226.5
1 a 806.5645 35.3871
1 a 925.5714 300.9286
1 b 708.0909 165.5455
1 b 630.8235 167.4118
2 a 555.3333 151.875
2 a 732.8947 462.3158
这是我想要的结果:
ID GroupID X Y Distance
1 a 772.7778 226.5 NA
1 a 806.5645 35.3871 dist between((772.7778,226.5),(806.5645,35.3871))
1 a 925.5714 300.9286 dist between((925.5714,300.9286),(806.5645,35.3871))
1 b 708.0909 165.5455 NA
1 b 630.8235 167.4118 dist between((708.0909,165.5455),(630.8235,167.4118))
2 a 555.3333 151.875 NA
2 a 732.8947 462.3158 dist between((732.8947,462.3158),(555.3333,151.875))
基本上就是ID和GroupID的距离。此处的 NA 表示在每个子组(例如 ID=1;GroupID=a)中,第一个距离为 NA。有没有人可以帮助我?谢谢!!!
这是dplyr
并使用dist
计算欧式距离的解决方案:
library(dplyr)
df <- read.table(text = "
ID GroupID X Y
1 a 772.7778 226.5
1 a 806.5645 35.3871
1 a 925.5714 300.9286
1 b 708.0909 165.5455
1 b 630.8235 167.4118
2 a 555.3333 151.875
2 a 732.8947 462.3158", header = T, stringsAsFactors = F)
df %>%
group_by(ID, GroupID) %>%
mutate(rows = row_number()) %>%
left_join(df, by = c('ID', 'GroupID')) %>%
rowwise() %>%
mutate(Distance = ifelse(dist(rbind(c(X.x, Y.x), c(X.y, Y.y))) != 0,
dist(rbind(c(X.x, Y.x), c(X.y, Y.y))),
NA)) %>%
filter(rows == 1) %>%
select(ID, GroupID, X = X.y, Y= Y.y, Distance)
## ID GroupID X Y Distance
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 a 772.7778 226.5000 NA
## 2 1 a 806.5645 35.3871 194.07648
## 3 1 a 925.5714 300.9286 169.95735
## 4 1 b 708.0909 165.5455 NA
## 5 1 b 630.8235 167.4118 77.28994
## 6 2 a 555.3333 151.8750 NA
## 7 2 a 732.8947 462.3158 357.63325
以前从未使用过 dist
,但这里有一个可能适合您的 for
循环:
> for(i in 1:nrow(df)) {
if(i > 1 && df$GroupID[i] == df$GroupID[i-1]) {
df$Distance[i] <- sqrt(((df$X[i] - df$X[i-1]) ^ 2) + ((df$Y[i] - df$Y[i-1]) ^ 2))
} else {
df$Distance[i] <- NA
}
}
> df
ID GroupID X Y Distance
1 1 a 772.7778 226.5000 NA
2 1 a 806.5645 35.3871 194.07648
3 1 a 925.5714 300.9286 290.98957
4 1 b 708.0909 165.5455 NA
5 1 b 630.8235 167.4118 77.28994
6 2 a 555.3333 151.8750 NA
7 2 a 732.8947 462.3158 357.63325
为什么不尝试这样的事情:
根据 ID 的组合拆分数据,应用距离函数,然后再拆分?
splitted <- split(dat[,c("X","Y")], paste(dat$ID,dat$GroupID))
distances <- lapply(splitted, function(x) {
if(nrow(x) > 2){ # diag() is useless for <= 2x2 matrix
c(NA,diag(as.matrix(dist(x))[,-1]))
} else {
c(NA,dist(x)[1])
}
})
dat$distances <- unsplit(distances, paste(dat$ID,dat$GroupID))
dat
ID GroupID X Y distances 1 1 a 772.7778 226.5000 NA 2 1 a 806.5645 35.3871 194.07648 3 1 a 925.5714 300.9286 290.98957 4 1 b 708.0909 165.5455 NA 5 1 b 630.8235 167.4118 77.28994 6 2 a 555.3333 151.8750 NA 7 2 a 732.8947 462.3158 357.63325
旁注:如果每个组超过 10k 行,dist 会变慢。