R:如何将行程段与交通智能卡数据组合成旅程

R: how to combine trip segments into a journey with Transport smart card data

目前正在处理一个有趣的交通智能卡数据集。当前数据中的每一行代表一次旅行(例如,从 A 到 B 的公共汽车旅行)。 60 分钟以内的任何行程都需要分组到旅程中。

当前table:

CustomerID  SegmentID   OriginLocation  Destination Start Time  End Time    Fare    TypeOfTransport ..  ..  ..
A001        101         LocationA   LocationB       7:30am      7:45am      1.5        Bus          
A001        102         LocationB   LocationC       7:50am      8:30am      3.5        Train            
A001        103         LocationC   LocationB       17:10pm     18:00pm     3.5        Train            
A001        104         LocationB   LocationA       18:10pm     18:30pm     1.5        Bus          
A002        105         LocationK   LocationY       11:30am     12:30pm     3          Train            
A003        106         LocationP   LocationO       10:23am     11:13am     4          Ferrie           

然后改头换面:

CustomerID  JourneyID   OriginLocation  Destination Start Time  End Time    Fare    TypeOfTransport    NumOfTrips
A001        1           LocationA       LocationC   7:30am      8:30am      5       Intermodal        2
A001        2           LocationC       LocationA   17:10pm     18:30pm     5       Intermodal        2
A002        6           LocationK       LocationY   11:30am     12:30pm     3       Train             1
A003        8           LocationP       LocationO   10:23am     11:13am     4       Ferrie            1

我是 R 的新手,不知道如何开始,所以任何指导将不胜感激。

这是您的示例数据的入门。使用您的真实数据,事情可能会变得更加复杂...

library(dplyr)

## prepare data
df<- read.table(header=T, stringsAsFactors = FALSE, text="
CustomerID  SegmentID   OriginLocation  Destination StartTime  EndTime    Fare    TypeOfTransport
A001        101         LocationA   LocationB       7:30am      7:45am      1.5        Bus
A001        102         LocationB   LocationC       7:50am      8:30am      3.5        Train
A001        103         LocationC   LocationB       17:10pm     18:00pm     3.5        Train
A001        104         LocationB   LocationA       18:10pm     18:30pm     1.5        Bus
A002        105         LocationK   LocationY       11:30am     12:30pm     3          Train
A003        106         LocationP   LocationO       10:23am     11:13am     4          Ferrie")
df$StartTime <- as.POSIXct(df$StartTime, format="%H:%M%p")
df$EndTime <- as.POSIXct(df$EndTime, format="%H:%M%p")


## Get group for each observation according to start-/endtimes + gap (60secs+60mins)
getGrp <- function(EndTimes, StartTimes, maxgap = 60*60) {
  vbetween <- Vectorize(dplyr::between, c("left", "right"))
  none <- Negate(any)
  if (length(EndTimes) > 1) {
    mat <-   vbetween(EndTimes, StartTimes-maxgap, StartTimes)
    idx <- apply(mat, 1, none, na.rm = T)
    grp <- cumsum(c(TRUE, head(idx, -1)))
  } else {
    grp <- 1L
  }
  return(grp)
}

## Summarise
df %>% 
  group_by(CustomerID) %>% 
  mutate(grp = getGrp(EndTime, StartTime)) %>%
  group_by(grp, add = TRUE) %>%
  summarise(Fare = sum(Fare), 
            NumOfTrips = n(), 
            StartTime = format(min(StartTime), "%H:%M"), 
            EndTime = format(max(EndTime), "%H:%M"), 
            OriginLocation = OriginLocation[row_number(1)],
            Destination = tail(Destination, 1),
            TypeOfTransport = list(TypeOfTransport)) %>%
  select(-grp) 

## Result
# CustomerID  Fare NumOfTrips StartTime EndTime OriginLocation Destination TypeOfTransport
# (chr) (dbl)      (int)     (chr)   (chr)          (chr)       (chr)           (chr)
# 1       A001     5          2     07:30   08:30      LocationA   LocationC        <chr[2]>
# 2       A001     5          2     17:10   18:30      LocationC   LocationA        <chr[2]>
# 3       A002     3          1     11:30   12:30      LocationK   LocationY        <chr[1]>
# 4       A003     4          1     10:23   11:13      LocationP   LocationO        <chr[1]>