如何使用 R 中的 spread() 和 gather() 函数将给定的预订行程数据集重建为所需的链接行程数据集?

How do reconstruct the given booked trip dataset to a desired linked trip dataset using probably spread() and gather() functions in R?

我有一个预订的旅行数据集如下:

bktrips <- data.frame(
  userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"), 
  mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"), 
  Origin = c("O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9"), 
  Destination = c("D1", "D2", "D3", "D4", "D5", "D6", "D7","D8", "D9" ), 
  depart_dt = c("2019-11-05 8:00:00","2019-11-05 8:30:00", "2019-11-05 11:00:00", "2019-11-05 11:40:00", "2019-11-06 8:00:00", "2019-11-06 9:10:00", "2019-11-07 8:00:00", "2019-11-08 8:00:00", "2019-11-08 8:50:00"), 
  Olat = c("-33.87085", "-33.87138", "-33.79504", "-33.87832", "-33.89158", "-33.88993", "-33.89173", "-33.88573", "-33.88505"), 
  Olon = c("151.2073", "151.2039", "151.2737", "151.2174","151.2485", "151.2805","151.2469", "151.2169","151.2156"), 
  Dlat = c("-33.87372", "-33.87384", "-33.88323", "-33.89165", "-33.88993", "-33.89177", "-33.88573", "-33.87731", "-33.88573"), 
  Dlon = c("151.1957", "151.2126", "151.2175", "151.2471","151.2471", "151.2805","151.2514", "151.2175","151.2169")
)

现在我需要重组这个预订的旅行数据集以准备 linked 旅行数据集。

例如,link将公共汽车和火车行程(由同一用户ID,P001)合并为一个linked行程,并重新定义此行程的起点和终点(O1和D2,分别)。

我们需要对link次行程使用规则(相同用户ID的行程,上次行程的目的地在下一次行程的起点附近(1公里以内),上次行程结束与下一次行程之间的时间间隔行程少于 60 分钟)。

在预订的行程数据集中,变量是:

有人可以帮我吗?我是 R 的新用户。非常感谢。

这是使用 dplyrgeosphere 计算距离的方法。我使用 lubridate 来修复您的日期列。

首先,我们修复了您的列的 类。接下来,我们依赖于旅行必须按时间顺序发生的事实。因此,我们计算从dplyrlag和从geospheredistHaversine的前一个目的地的距离以及距上次出发的时间。

library(dplyr)
library(geosphere)
library(lubridate)
bktrips %>%
  mutate(depart_dt = ymd_hms(depart_dt)) %>%
  mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
  group_by(userID) %>% 
  arrange(depart_dt,.by_group = TRUE) %>%
  mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
         TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
  dplyr::select(-depart_dt,-contains(c("lat","lon")))
  userID mode  Origin Destination DistPrevDest TimePrevDep
  <fct>  <fct> <fct>  <fct>              <dbl> <drtn>     
1 P001   bus   O1     D1                   NA    NA mins  
2 P001   train O2     D2                  801.   30 mins  
3 P001   taxi  O3     D3                10434.  150 mins  
4 P001   bus   O4     D4                  547.   40 mins  
5 P001   train O5     D5                  130. 1220 mins  
6 P002   taxi  O6     D6                   NA    NA mins  
7 P002   bus   O7     D7                 3105. 1370 mins  
8 P002   train O8     D8                 3188. 1440 mins  
9 P002   taxi  O9     D9                  879.   50 mins  

现在我们可以使用一些逻辑和 cumsum 添加 TripID

然后我们按TripID分组,用summarize重新定义所有的列。

bktrips %>%
  mutate(depart_dt = ymd_hms(depart_dt)) %>%
  bktrips %>%
  mutate(depart_dt = ymd_hms(depart_dt)) %>%
  mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
  group_by(userID) %>% 
  arrange(depart_dt,.by_group = TRUE) %>%
  mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
         TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
  mutate(TripID = cumsum(!((is.na(DistPrevDest) | DistPrevDest < 1000) & (is.na(TimePrevDep) |TimePrevDep < 60)))) %>%
  group_by(userID,TripID) %>%
  summarize(mode = paste(mode,collapse = ","),
            Origin = first(Origin),
            Destination = last(Destination),
            depart_dt = paste(depart_dt,collapse = ","),
            Olat = first(Olat),
            Olon = first(Olon),
            Dlat = last(Dlat),
            Dlon = last(Dlon))
  userID TripID mode       Origin Destination depart_dt                                Olat  Olon  Dlat  Dlon
  <fct>   <int> <chr>      <fct>  <fct>       <chr>                                   <dbl> <dbl> <dbl> <dbl>
1 P001        0 bus,train  O1     D2          2019-11-05 08:00:00,2019-11-05 08:30:00 -33.9  151. -33.9  151.
2 P001        1 taxi,bus   O3     D4          2019-11-05 11:00:00,2019-11-05 11:40:00 -33.8  151. -33.9  151.
3 P001        2 train      O5     D5          2019-11-06 08:00:00                     -33.9  151. -33.9  151.
4 P002        0 taxi       O6     D6          2019-11-06 09:10:00                     -33.9  151. -33.9  151.
5 P002        1 bus        O7     D7          2019-11-07 08:00:00                     -33.9  151. -33.9  151.
6 P002        2 train,taxi O8     D9          2019-11-08 08:00:00,2019-11-08 08:50:00 -33.9  151. -33.9  151.

我建议您在数据中也包括到达时间,而不是计算出发时间和上次到达时间之间的差异。

编辑: 漏了一个cumsum()。现在修好了。另外,不再需要rleid

我不清楚你想用它去哪里,但这里是计算每组行程距离和时间的开始(按用户 ID)。我不得不快速找到一个包来计算经度和纬度的距​​离并找到 geosphere。 希望这有帮助。

library(dplyr)
library(tibble)
library(geosphere)

bktrips <- tibble(
  userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"), 
  mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"), 
  Origin = c("O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9"), 
  Destination = c("D1", "D2", "D3", "D4", "D5", "D6", "D7","D8", "D9" ), 
  depart_dt = c("2019-11-05 8:00:00","2019-11-05 8:30:00", "2019-11-05 11:00:00", "2019-11-05 11:40:00", "2019-11-06 8:00:00", "2019-11-06 9:10:00", "2019-11-07 8:00:00", "2019-11-08 8:00:00", "2019-11-08 8:50:00"), 
  Olat = c("-33.87085", "-33.87138", "-33.79504", "-33.87832", "-33.89158", "-33.88993", "-33.89173", "-33.88573", "-33.88505"), 
  Olon = c("151.2073", "151.2039", "151.2737", "151.2174","151.2485", "151.2805","151.2469", "151.2169","151.2156"), 
  Dlat = c("-33.87372", "-33.87384", "-33.88323", "-33.89165", "-33.88993", "-33.89177", "-33.88573", "-33.87731", "-33.88573"), 
  Dlon = c("151.1957", "151.2126", "151.2175", "151.2471","151.2471", "151.2805","151.2514", "151.2175","151.2169")
)

bktrips <- bktrips %>%
  mutate(depart_dt = as.POSIXct(depart_dt, format = "%Y-%m-%d %H:%M:%S"),
         Olat = as.numeric(Olat),
         Olon = as.numeric(Olon),
         Dlat = as.numeric(Dlat),
         Dlon = as.numeric(Dlon)) %>%
  group_by(userID) %>%
  mutate(trip_time = as.numeric(depart_dt - lag(depart_dt), units = 'mins')) %>%
  rowwise() %>%
  mutate(trip_distance = distm(x = c(Olon, Olat), y = c(Dlon, Dlat), fun = distHaversine))

> bktrips
Source: local data frame [9 x 11]
Groups: <by row>

# A tibble: 9 x 11
  userID mode  Origin Destination depart_dt            Olat  Olon  Dlat  Dlon trip_time trip_distance
  <chr>  <chr> <chr>  <chr>       <dttm>              <dbl> <dbl> <dbl> <dbl>     <dbl>         <dbl>
1 P001   bus   O1     D1          2019-11-05 08:00:00 -33.9  151. -33.9  151.        NA         1119.
2 P001   train O2     D2          2019-11-05 08:30:00 -33.9  151. -33.9  151.        30          849.
3 P001   taxi  O3     D3          2019-11-05 11:00:00 -33.8  151. -33.9  151.       150        11108.
4 P001   bus   O4     D4          2019-11-05 11:40:00 -33.9  151. -33.9  151.        40         3120.
5 P001   train O5     D5          2019-11-06 08:00:00 -33.9  151. -33.9  151.      1220          225.
6 P002   taxi  O6     D6          2019-11-06 09:10:00 -33.9  151. -33.9  151.        NA          205.
7 P002   bus   O7     D7          2019-11-07 08:00:00 -33.9  151. -33.9  151.      1370          787.
8 P002   train O8     D8          2019-11-08 08:00:00 -33.9  151. -33.9  151.      1440          939.
9 P002   taxi  O9     D9          2019-11-08 08:50:00 -33.9  151. -33.9  151.        50          142.