如何在 r 中使用 dplyr 循环计算多个实例的距离

Question

我有位置数据，示例如下，其中time是每个位置被记录的时间，ref是每个点的参考，x是每个点的 x 坐标，y 是每个点的 y 坐标。

    > print(df)
   time ref     x     y
1     1   1 92.80 49.58
2     1   2 90.20 96.02
3     1   3 91.61 80.05
4     1   4 68.75 20.56
5     1   5  5.53 35.27
6     1   6 39.85 85.39
7     1   7 12.04 87.43
8     1   8 42.98 56.53
9     1   9 19.14 63.56
10    1  10 25.72  7.62
11    2   1 50.39  7.16
12    2   2 17.71  7.15
13    2   3 52.96 34.87
14    2   4 52.70 97.07
15    2   5 70.88 44.88
16    2   6 32.12 71.82
17    2   7 24.15 22.77
18    2   8 18.06 31.03
19    2   9 70.55 92.42
20    2  10 45.05 79.67

我要进行的步骤如下（步骤1~4已成功完成）

多次复制 x 和 y 坐标，但有小错误
计算每个时刻每个点之间的距离
为每个时间实例计算这 45 个距离的总和
在我在步骤 1 中创建的所有不同迭代中重复此过程
创建一个包含所有这些信息的新数据框

步骤 1.

set.seed(456) #set seed to get consistent results

n <- 3 # this is 3 for this example but would likely be 1000 or 10000 and refers to the number of simulations


for(i in seq(5,(2*n+3),2)){ #create simulations of the xy data set
  df[,i] = df[,3] + rnorm(length(df[,2]),0,1) #replicates the x column 
  df[,i+1] = df[,4] + rnorm(length(df[,3]),0,1) # replicates the y column
}

此代码有效且易于调整，并为我提供了以下 df。前 4 列与上面完全相同。 V5 和 V6 是 n=1 的 x 和 y 坐标，与原始 x 和 y 有一个小误差（你可以看到这些值有多相似） V7 和 V8 是 n=2 的 x 和 y，V9 和 V10 是n=3

的 x 和 y

 print(df)
   time ref     x     y        V5        V6        V7        V8        V9       V10
1     1   1 92.80 49.58 91.456479 49.105396 92.771058 47.325290 91.720518 49.698151
2     1   2 90.20 96.02 90.821776 94.302691 90.593037 95.037940 89.758626 96.889903
3     1   3 91.61 80.05 92.410875 78.623170 91.360386 79.849432 93.630635 79.958064
4     1   4 68.75 20.56 67.361108 20.768236 68.833450 21.455930 68.822856 20.628899
5     1   5  5.53 35.27  4.815643 35.234164  7.608875 35.226455  6.238817 33.587573
6     1   6 39.85 85.39 39.525939 86.524285 39.970852 87.037308 40.700509 86.506956
7     1   7 12.04 87.43 12.730643 86.967145 12.158149 88.993299 10.553803 86.078642
8     1   8 42.98 56.53 43.230548 56.201616 43.750054 55.098622 43.900530 55.992833
9     1   9 19.14 63.56 20.147352 65.044539 17.964598 63.015406 19.288329 63.189886
10    1  10 25.72  7.62 26.293235  6.530622 26.129039  6.848746 25.483132  7.974012
11    2   1 50.39  7.16 49.474189  6.631206 49.725049  6.990012 49.916764  6.350175
12    2   2 17.71  7.15 19.021097  6.556207 17.453475  7.109238 17.040794  6.970275
13    2   3 52.96 34.87 53.948726 32.871084 53.638782 33.149460 54.318527 33.722340
14    2   4 52.70 97.07 54.353929 97.366153 53.596845 98.514106 54.112918 97.166242
15    2   5 70.88 44.88 69.439195 45.050625 71.498356 44.859985 70.147226 45.694700
16    2   6 32.12 71.82 34.067356 73.635652 32.851454 72.090232 32.039448 72.802941
17    2   7 24.15 22.77 25.886936 22.109397 23.736825 22.657066 24.960197 23.620843
18    2   8 18.06 31.03 18.447483 30.889748 19.617813 30.175112 18.562588 32.237347
19    2   9 70.55 92.42 72.830034 91.996021 71.091699 91.386259 71.674023 90.986222
20    2  10 45.05 79.67 46.587883 79.631264 45.627150 79.892027 44.878720 78.569054

步骤 2

我使用 dplyr 创建了代码，它按时间对数据进行分组，然后计算每个参考点之间的距离（此代码显示在步骤 3 中）。有 10 个参考点导致需要计算 45 个距离（10 选择 2）。

步骤 3 对于每组时间，我想计算所有 45 个距离的总和。第2步和第3步在以下代码中，已制成函数

sumdist = function(data) {
  names(data)[3]<-paste("x") #renames 3rd column x to assist for loop
  names(data)[4]<-paste("y") #renames 4th column y to assist for loop
  data = data %>% 
    group_by(time) %>% 
    mutate(dist1 = sqrt((x[which(ref == 1)] - x)^2 + (y[which(ref == 1)] - y)^2)) %>% #distance beween all points and point 1
    mutate(dist2 = sqrt((x[which(ref == 2)] - x)^2 + (y[which(ref == 2)] - y)^2)) %>% #distance beween all points and point 2
    mutate(dist3 = sqrt((x[which(ref == 3)] - x)^2 + (y[which(ref == 3)] - y)^2)) %>% #distance beween all points and point 3
    mutate(dist4 = sqrt((x[which(ref == 4)] - x)^2 + (y[which(ref == 4)] - y)^2)) %>% #distance beween all points and point 4
    mutate(dist5 = sqrt((x[which(ref == 5)] - x)^2 + (y[which(ref == 5)] - y)^2)) %>% #distance beween all points and point 5
    mutate(dist6 = sqrt((x[which(ref == 6)] - x)^2 + (y[which(ref == 6)] - y)^2)) %>% #distance beween all points and point 6
    mutate(dist7 = sqrt((x[which(ref == 7)] - x)^2 + (y[which(ref == 7)] - y)^2)) %>% #distance beween all points and point 7
    mutate(dist8 = sqrt((x[which(ref == 8)] - x)^2 + (y[which(ref == 8)] - y)^2)) %>% #distance beween all points and point 8
    mutate(dist9 = sqrt((x[which(ref == 9)] - x)^2 + (y[which(ref == 9)] - y)^2)) %>% #distance beween all points and point 9
    mutate(dist10 = sqrt((x[which(ref == 10)] - x)^2 + (y[which(ref == 10)] - y)^2)) %>% #distance beween all points and point 10
    summarise(sumdistances = (sum(dist1,dist2,dist3,dist4,dist5,dist6,dist7,dist8,dist9,dist10))/2) #sum of all distances
  print(data$sumdistances)
}

当运行在我的 df 上使用此函数时，它仅使用第一个 x 和 y 进行计算，但它有效。产生长度为 2 的向量。第一个值用于时间 1，第二个值用于时间 2

> sumdist(df) # this calculates it from the original x and y 
[1] 2706.592 2275.045

步骤 4

我现在想在我之前创建的多个迭代中重复此操作。对于我的实际数据集，n 将以千计，所以我需要自动执行此过程

sumd = matrix(NA, nrow=2, ncol=n+1) # set collection matrix for nrow = number of time and #ncol = number simulations

for(i in 1:(n+1)) {
  datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
  sumd[i] = sumdist(datas) # runs function on each simulated data set
}

因为我的函数在最后打印了计算的数据，运行代码表明它已经计算出了我想要的结果

> for(i in 1:(n+1)) {
+   datas = df[,c(1,2,((1+2*i)),(2+(2*i))),] # extracts the time, ref along with x and y for each simulations
+   sumd[i] = sumdist(datas) # runs function on each simulated data set
+ }
[1] 2706.592 2275.045
[1] 2695.796 2282.284
[1] 2713.277 2288.517
[1] 2719.587 2273.316

底部 4 行是我要计算的内容，尽管顺序不完全

理想情况下它应该看起来更像这样

 time       V2       V3       V4       V5
1    1 2706.592 2695.796 2713.277 2719.587
2    2 2275.045 2282.284 2288.517 2273.316

步骤 5

但是我的矩阵的一半仍然包含 NA 并且是这样填充的：

> print(sumd)
         [,1]     [,2] [,3] [,4]
[1,] 2706.592 2713.277   NA   NA
[2,] 2695.796 2719.587   NA   NA

我收到的错误是这样的

Warning messages:
1: In sumd[i] <- sumdist(datas) :
  number of items to replace is not a multiple of replacement length
2: In sumd[i] <- sumdist(datas) :
  number of items to replace is not a multiple of replacement length
3: In sumd[i] <- sumdist(datas) :
  number of items to replace is not a multiple of replacement length
4: In sumd[i] <- sumdist(datas) :
  number of items to replace is not a multiple of replacement length

关于哪里出了问题，这似乎是直截了当的。我创建的矩阵不适合输出。我尝试以多种方式更改矩阵以使其适合，但是我一直收到错误消息，最终似乎无法获得包含我想要的信息的矩阵或数据框。

编辑 - 我现在明白我的初始代码中的错误导致它无法正常工作，这自然很简单。 sumd[i] 应该读作 sumd[,i]

Answer 1

好的，在你的编辑之后我意识到我误解了你的问题。

我认为您的设计存在问题，因为您想提前创建列。显然，他们不能有一个合适的名字，这使得识别x和y有点困难。

这是我的建议：添加高斯噪声并即时计算总和。

首先，让我们重新创建数据框（下次您可以共享此代码或一些 dput 输出，这样可以更轻松地提供帮助）。

library(tidyverse)
df = read.table(header=TRUE, text="
time ref     x     y
1     1   1 92.80 49.58
2     1   2 90.20 96.02
3     1   3 91.61 80.05
4     1   4 68.75 20.56
5     1   5  5.53 35.27
6     1   6 39.85 85.39
7     1   7 12.04 87.43
8     1   8 42.98 56.53
9     1   9 19.14 63.56
10    1  10 25.72  7.62
11    2   1 50.39  7.16
12    2   2 17.71  7.15
13    2   3 52.96 34.87
14    2   4 52.70 97.07
15    2   5 70.88 44.88
16    2   6 32.12 71.82
17    2   7 24.15 22.77
18    2   8 18.06 31.03
19    2   9 70.55 92.42
20    2  10 45.05 79.67")

然后，让我们重写距离计算，因为我发现你的代码有点多余。编程经验法则：DRY。如果你重复一个结构超过 3 次，你可能应该写一些函数。

options(dplyr.summarise.inform=FALSE) #don't care about those warnings
distance = function(x1,x2,y1,y2) sqrt(((x2-x1)^2)+((y2-y1)^2))
distance2 = function(x,y,.pred) distance(x, x[.pred], y, y[.pred])    
distance_sum = function(x, y, ref){
    dists = map(1:10, ~distance2(x,y, which(ref == .x)))
    sum(unlist(dists))/2
}

在这里，我可以在 x 和 y 上重现您的结果：

df %>% 
    group_by(time) %>% 
    summarise(sum=distance_sum(x, y, ref))
#> # A tibble: 2 x 2
#>    time   sum
#>   <int> <dbl>
#> 1     1 2707.
#> 2     2 2275.

最后，我们可以将其复制一定次数，预先添加一些随机噪声。同样，结果值与您的相同。

set.seed(456)
n <- 3 #or 10000
xx = rerun(n, {
    df %>% 
        mutate(x=x+rnorm(length(x),0,1), 
               y=y+rnorm(length(y),0,1)) %>% 
        group_by(time) %>% 
        summarise(sum=distance_sum(x, y, ref)) %>% 
        as.data.frame() #needed for the precision in the example, you can drop this line
})
xx
#> [[1]]
#>   time      sum
#> 1    1 2695.796
#> 2    2 2282.284
#> 
#> [[2]]
#>   time      sum
#> 1    1 2713.277
#> 2    2 2288.517
#> 
#> [[3]]
#>   time      sum
#> 1    1 2719.587
#> 2    2 2273.316

然后您可以 rbind 列表并计算一些统计数据：

xx %>% #this was run with n=25
    reduce(rbind) %>% 
    group_by(time) %>% 
    summarise(sum_m=mean(sum), sum_sd=sd(sum))
#> # A tibble: 2 x 3
#>    time sum_m sum_sd
#>   <int> <dbl>  <dbl>
#> 1     1 2711.   22.2
#> 2     2 2280.   16.8


Created on 2020-06-18 by the reprex package (v0.3.0)

Answer 2

df <- tibble(
  ref = rep(c(1, 2, 3), each = 5),
  x = rnorm(15, 10, 8),
  y = rnorm(15, 35, 20)
)

# Number of created points
n <- 3

# Putting x and y as point
df <- df %>%
  mutate(point = map2(x, y, c)) 

# Adding noise to point
new_points <- seq_len(n)
names(new_points) <- new_points %>% str_c("point_", .)
new_cols <- new_points %>%
  map(~list(rnorm(15), rnorm(15)) %>% transpose() %>% map(unlist)) %>%
  map(~map2(.x, df$point, ~.x+.y)) %>%
  as_tibble()

# Binding new points 
df <- df %>%
  bind_cols(new_cols)

# Functions for calculating euclidian distance of point list
dList <- function(a, b)
  b %>% 
    map_dbl(~(a - .x)^2 %>% sum() %>% sqrt())
sumDistanceList <- function(l)
  seq_len(length(l) - 1) %>%
    map(~dList(l[[.x]], l[(.x+1):length(l)])) %>%
    unlist() %>%
    sum()

# Summarise
df %>%
  group_by(ref) %>%
  summarise(across(str_subset(names(.), "point_"), sumDistanceList))

如何在 r 中使用 dplyr 循环计算多个实例的距离

How to loop distance calculations for multiple instances using dplyr in r

simulation

for-loop

r

euclidean-distance

dplyr