如何计算 R 中分组数据行之间距离和时间差的所有可能组合？

Question

我的数据包括美元钞票的行驶距离和时间。我的数据如下所示：

   bid ts latitude longitude
1  123  0 38.40513  41.83777
2  123 23 38.41180  41.68493
3  123 45 42.20771  43.36318
4  123 50 40.22803  43.00208
5  456  0 39.12882  42.73877
6  456 12 38.46078  42.79847
7  456 27 40.53698  42.57617
8  456 19 39.04038  42.17070
9  234  0 39.18274  41.17445
10 234  8 39.58652  43.61317
11 234 15 41.32383  41.49377
12 234 23 40.26008  42.01927

出价=账单编号

ts = 当 t = 0 时从原始数据点计算的时间戳（天）

纬度和经度=位置

此数据显示了美国各地账单 ID 的变动情况。

我想计算每行 4 组的所有可能组合之间距离和时间的平方差。例如，对于出价 123 的组，我想计算以下距离和时间的差值：行1和第2行，第1行和第3行，第1行和第4行，第2行和第3行，第2行和第4行，第 3 行和第 4 行。

这会给我这组出价之间所有可能的计算组合。

我能够像这样在连续的行之间使用 dplyr 做到这一点：

detach("package:plyr", unload=TRUE)
library(magrittr)
library(dplyr)
library(geosphere)

deltadata <- group_by(df, bid) %>%

       mutate(

          dsq = (c(NA,distHaversine(cbind(longitude[-n()], latitude[-n()]),
                    cbind(longitude[  -1], latitude[  -1]))))^2,
          dt = c(NA, diff(ts))

              )%>%

ungroup() %>%
filter( ! is.na(dsq) )
deltadata

# A tibble: 21 x 6
     bid    ts latitude longitude          dsq    dt
   <dbl> <dbl>    <dbl>     <dbl>        <dbl> <dbl>
 1   123    23 38.41180  41.68493    178299634    23
 2   123    45 42.20771  43.36318 198827672092    22
 3   123    50 40.22803  43.00208  49480260636     5
 4   456    12 38.46078  42.79847   5557152213    12
 5   456    27 40.53698  42.57617  53781504422    15
 6   456    19 39.04038  42.17070  28958550947    -8
 7   234     8 39.58652  43.61317  46044153364     8
 8   234    15 41.32383  41.49377  69621429008     7
 9   234    23 40.26008  42.01927  15983792199     8
 10   345     5 40.25700  41.69525  26203255328     5
# ... with 11 more rows

问题：这仅计算连续行之间的平方距离和时间，即：第 1 行和第 2 行、第 2 行和第 3 行、第 3 行和第 4 行

有没有一种实用的方法可以让我对每组中所有可能的行组合执行此操作？

我希望我的输出对每个出价进行 6 次计算，如下所示：

# A tibble: 21 x 6
     bid    ts latitude longitude          dsq    dt
   <dbl> <dbl>    <dbl>     <dbl>        <dbl> <dbl>
 1   123    23 38.41180  41.68493    178299634    23  (for rows 1 and 2)
 2   123    45 42.20771  43.36318 198827672092    22  (for rows 1 and 3)
 3   123    50 40.22803  43.00208  49480260636     5  (for rows 1 and 4)
 4   123    12 38.46078  42.79847   5557152213    12  (for rows 2 and 3)
 5   123    27 40.53698  42.57617  53781504422    15  (for rows 2 and 4)
 6   123    19 39.04038  42.17070  28958550947    -8  (for rows 2 and 5)

我是 R 的新手，所以任何建议都不胜感激！

Answer 1

您可以像这样使用 inner_join：

library(dplyr)
library(geosphere)

df <- read.table(text = '   bid ts latitude longitude
1  123  0 38.40513  41.83777
2  123 23 38.41180  41.68493
3  123 45 42.20771  43.36318
4  123 50 40.22803  43.00208
5  456  0 39.12882  42.73877
6  456 12 38.46078  42.79847
7  456 27 40.53698  42.57617
8  456 19 39.04038  42.17070
9  234  0 39.18274  41.17445
10 234  8 39.58652  43.61317
11 234 15 41.32383  41.49377
12 234 23 40.26008  42.01927')


df %>%
  inner_join(df, by = c("bid" = "bid")) %>%
  mutate(
    dsq = distHaversine(cbind(longitude.x, latitude.x),
                        cbind(longitude.y, latitude.y))^2,
    dt = ts.x -ts.y
  ) %>%
  filter(dt > 0)
#>    bid ts.x latitude.x longitude.x ts.y latitude.y longitude.y          dsq dt
#> 1  123   23   38.41180    41.68493    0   38.40513    41.83777    178300279 23
#> 2  123   45   42.20771    43.36318    0   38.40513    41.83777 195932999496 45
#> 3  123   45   42.20771    43.36318   23   38.41180    41.68493 198827439286 22
#> 4  123   50   40.22803    43.00208    0   38.40513    41.83777  51230447939 50
#> 5  123   50   40.22803    43.00208   23   38.41180    41.68493  53740739037 27
#> 6  123   50   40.22803    43.00208   45   42.20771    43.36318  49479978030  5
#> 7  456   12   38.46078    42.79847    0   39.12882    42.73877   5557111219 12
#> 8  456   27   40.53698    42.57617    0   39.12882    42.73877  24765506646 27
#> 9  456   27   40.53698    42.57617   12   38.46078    42.79847  53781664569 15
#> 10 456   27   40.53698    42.57617   19   39.04038    42.17070  28958542352  8
#> 11 456   19   39.04038    42.17070    0   39.12882    42.73877   2506329323 19
#> 12 456   19   39.04038    42.17070   12   38.46078    42.79847   7133122323  7
#> 13 234    8   39.58652    43.61317    0   39.18274    41.17445  46043956815  8
#> 14 234   15   41.32383    41.49377    0   39.18274    41.17445  57544071797 15
#> 15 234   15   41.32383    41.49377    8   39.58652    43.61317  69621225065  7
#> 16 234   23   40.26008    42.01927    0   39.18274    41.17445  19614888600 23
#> 17 234   23   40.26008    42.01927    8   39.58652    43.61317  24136886438 15
#> 18 234   23   40.26008    42.01927   15   41.32383    41.49377  15983645507  8

Answer 2

并且由于您还使用了 data.table 标签，这里有一个使用该包的解决方案：

library(data.table)
library(geosphere)

df <- read.table(text = '   bid ts latitude longitude
1  123  0 38.40513  41.83777
2  123 23 38.41180  41.68493
3  123 45 42.20771  43.36318
4  123 50 40.22803  43.00208
5  456  0 39.12882  42.73877
6  456 12 38.46078  42.79847
7  456 27 40.53698  42.57617
8  456 19 39.04038  42.17070
9  234  0 39.18274  41.17445
10 234  8 39.58652  43.61317
11 234 15 41.32383  41.49377
12 234 23 40.26008  42.01927')
dt <- data.table(df, key = 'bid')
dt <- dt[dt, allow.cartesian = TRUE][ts < i.ts]
dt[, dt := i.ts - ts][, dsq := distHaversine(cbind(longitude, latitude),
                                             cbind(i.longitude, i.latitude))^2]
dt
#>     bid ts latitude longitude i.ts i.latitude i.longitude dt          dsq
#>  1: 123  0 38.40513  41.83777   23   38.41180    41.68493 23    178300279
#>  2: 123  0 38.40513  41.83777   45   42.20771    43.36318 45 195932999496
#>  3: 123 23 38.41180  41.68493   45   42.20771    43.36318 22 198827439286
#>  4: 123  0 38.40513  41.83777   50   40.22803    43.00208 50  51230447939
#>  5: 123 23 38.41180  41.68493   50   40.22803    43.00208 27  53740739037
#>  6: 123 45 42.20771  43.36318   50   40.22803    43.00208  5  49479978030
#>  7: 234  0 39.18274  41.17445    8   39.58652    43.61317  8  46043956815
#>  8: 234  0 39.18274  41.17445   15   41.32383    41.49377 15  57544071797
#>  9: 234  8 39.58652  43.61317   15   41.32383    41.49377  7  69621225065
#> 10: 234  0 39.18274  41.17445   23   40.26008    42.01927 23  19614888600
#> 11: 234  8 39.58652  43.61317   23   40.26008    42.01927 15  24136886438
#> 12: 234 15 41.32383  41.49377   23   40.26008    42.01927  8  15983645507
#> 13: 456  0 39.12882  42.73877   12   38.46078    42.79847 12   5557111219
#> 14: 456  0 39.12882  42.73877   27   40.53698    42.57617 27  24765506646
#> 15: 456 12 38.46078  42.79847   27   40.53698    42.57617 15  53781664569
#> 16: 456 19 39.04038  42.17070   27   40.53698    42.57617  8  28958542352
#> 17: 456  0 39.12882  42.73877   19   39.04038    42.17070 19   2506329323
#> 18: 456 12 38.46078  42.79847   19   39.04038    42.17070  7   7133122323

如何计算 R 中分组数据行之间距离和时间差的所有可能组合？

How to calculate all possible combinations of distance and time differences between grouped rows of data in R?

r

dataframe

dplyr

data.table

geosphere