在随时间重复的顺序数据中创建唯一组

Creating unique groups in sequential data that repeats through time

这种东西以前也有人问过,但不是我能找到的。

Thread about creating sequential IDs, with several additional links

按顺序创建标识符并不难,但我的数据包含一个让我陷入循环的时间元素。以下数据是一个虚构的数据集,只是为了以易于处理的方式说明问题:

    dput(walking_dat)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
"Uptown"), class = "factor"), street = structure(c(4L, 3L, 3L, 
5L, 3L, 4L, 6L, 7L, 4L, 4L, 1L, 2L, 1L), .Label = c("12thAve", 
"14thAve", "Dupont", "Hennepin", "Lyndale", "Marquette", "Nicolette"
), class = "factor"), sequence = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 
5, 1, 2, 3), visit = c(1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2)), .Names = c("neighborhood", 
"street", "sequence", "visit"), row.names = c(NA, -13L), class = "data.frame")

   neighborhood    street sequence visit
1        Uptown  Hennepin        1     1
2        Uptown    Dupont        2     1
3        Uptown    Dupont        3     1
4        Uptown   Lyndale        4     1
5        Uptown    Dupont        5     2
6      Downtown  Hennepin        1     1
7      Downtown Marquette        2     1
8      Downtown Nicolette        3     1
9      Downtown  Hennepin        4     2
10     Downtown  Hennepin        5     2
11    Dinkytown   12thAve        1     1
12    Dinkytown   14thAve        2     1
13    Dinkytown   12thAve        3     2

为了想象,所有数据均来自在明尼阿波利斯三个街区向东行走的三个人。每行代表记录其位置的时间。第一列是他们走过的街区。第二列是他们在每个时间点所在的交叉点。第三列是这些数据出现的顺序。

我想创建 visit 列来记录同一街道、同一街区的连续时间点,作为单次访问,​​随后 return 次访问作为下一次访问。如何创建这种顺序标识符?


我在想这个 ave()FUN=seq_along 的技巧可能会奏效,但我找不到一种方法来组合让我到达我想要的位置的因素。

Create a sequential number (counter) for rows within each group of a dataframe [duplicate]


更新:Uwe 的解决方案有效,但如果有人决定在一个十字路口进行所有测量,就会被破坏,这就是我尝试将其放入真实数据时发生的情况。如果发生这种情况,则原始行数不会 return 编辑到最终的 data.table。看看这里发生了什么:

dput(walking_dat_2)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
"Uptown"), class = "factor"), street2 = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve", 
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1, 
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit_2 = c(1, 1, 1, 1, 
1, 1, 1, 1, 2, 2, 1, 1, 1)), .Names = c("neighborhood", "street2", 
"sequence", "visit_2"), row.names = c(NA, -13L), class = "data.frame")

   neighborhood   street2 sequence visit_2
1        Uptown  Hennepin        1       1
2        Uptown  Hennepin        2       1
3        Uptown  Hennepin        3       1
4        Uptown  Hennepin        4       1
5        Uptown  Hennepin        5       1
6      Downtown  Hennepin        1       1
7      Downtown Marquette        2       1
8      Downtown Nicolette        3       1
9      Downtown  Hennepin        4       2
10     Downtown  Hennepin        5       2
11    Dinkytown   12thAve        1       1
12    Dinkytown   12thAve        2       1
13    Dinkytown   12thAve        3       1

在这种情况下,运行 Uwe 的解决方案 return 只有 6 行。

library(data.table)
setDT(walking_dat)[, visit_2 := rleid(neighborhood, street2)][
     , unique(.SD, by = "visit_2")][
         , visit_2 := rowid(neighborhood, street2)][
             walking_dat, on = .(neighborhood, street2, sequence), roll = TRUE, visit_2 := x.visit_2][]

   neighborhood   street2 sequence visit visit_2
1:       Uptown  Hennepin        1     1       1
2:     Downtown  Hennepin        1     2       1
3:     Downtown Marquette        2     3       1
4:     Downtown Nicolette        3     4       1
5:     Downtown  Hennepin        4     5       2
6:    Dinkytown   12thAve        1     6       1
# Not required, but convenient:
walking_dat$combo <- paste(walking_dat$neighborhood, walking_dat$street)

# Place holder:
walking_dat$visit <- NA

# Create it:
for(i in 1:nrow(walking_dat)){
  if(i %in% row.names(walking_dat[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0)), ])){
    walking_dat$visit[i] <- sum(walking_dat$combo[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0))][1:i]==walking_dat$combo[i], na.rm=T)
  } else{
    walking_dat$visit[i] <- 1
  }
}

walking_dat
   neighborhood    street sequence visit              combo
1        Uptown  Hennepin        1     1    Uptown Hennepin
2        Uptown    Dupont        2     1      Uptown Dupont
3        Uptown    Dupont        3     1      Uptown Dupont
4        Uptown   Lyndale        4     1     Uptown Lyndale
5        Uptown    Dupont        5     2      Uptown Dupont
6      Downtown  Hennepin        1     1  Downtown Hennepin
7      Downtown Marquette        2     1 Downtown Marquette
8      Downtown Nicolette        3     1 Downtown Nicolette
9      Downtown  Hennepin        4     2  Downtown Hennepin
10     Downtown  Hennepin        5     1  Downtown Hennepin
11    Dinkytown   12thAve        1     2  Dinkytown 12thAve
12    Dinkytown   14thAve        2     1  Dinkytown 14thAve
13    Dinkytown   12thAve        3     2  Dinkytown 12thAve

这里的难点在于,在同一街区的同一条街道上的后续记录应计为一次访问。这需要将这些行折叠成一个,计算对不同街区和街道的访问,最后将其扩展到原始行数。

请注意,包含预期结果的 visit不会 被覆盖,但会保留以与计算的 visit_new 列进行比较。

library(data.table)
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][
      walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         1
 3:       Uptown    Dupont        3     1         1
 4:       Uptown   Lyndale        4     1         1
 5:       Uptown    Dupont        5     2         2
 6:     Downtown  Hennepin        1     1         1
 7:     Downtown Marquette        2     1         1
 8:     Downtown Nicolette        3     1         1
 9:     Downtown  Hennepin        4     2         2
10:     Downtown  Hennepin        5     2         2
11:    Dinkytown   12thAve        1     1         1
12:    Dinkytown   14thAve        2     1         1
13:    Dinkytown   12thAve        3     2         2

分步说明

DF 被强制转换为 data.table。 rleid() 函数为街区和街道的变化创建唯一编号。

 setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         2
 3:       Uptown    Dupont        3     1         2
 4:       Uptown   Lyndale        4     1         3
 5:       Uptown    Dupont        5     2         4
 6:     Downtown  Hennepin        1     1         5
 7:     Downtown Marquette        2     1         6
 8:     Downtown Nicolette        3     1         7
 9:     Downtown  Hennepin        4     2         8
10:     Downtown  Hennepin        5     2         8
11:    Dinkytown   12thAve        1     1         9
12:    Dinkytown   14thAve        2     1        10
13:    Dinkytown   12thAve        3     2        11

请注意,第 2 行和第 3 行以及第 9 行和第 10 行会重复。在创建新的临时 data.table 对象的下一步中删除重复项:

setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         2
 3:       Uptown   Lyndale        4     1         3
 4:       Uptown    Dupont        5     2         4
 5:     Downtown  Hennepin        1     1         5
 6:     Downtown Marquette        2     1         6
 7:     Downtown Nicolette        3     1         7
 8:     Downtown  Hennepin        4     2         8
 9:    Dinkytown   12thAve        1     1         9
10:    Dinkytown   14thAve        2     1        10
11:    Dinkytown   12thAve        3     2        11

现在,我们可以使用 rowid() 函数对不同街区和街道的访问进行编号:

setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         1
 3:       Uptown   Lyndale        4     1         1
 4:       Uptown    Dupont        5     2         2
 5:     Downtown  Hennepin        1     1         1
 6:     Downtown Marquette        2     1         1
 7:     Downtown Nicolette        3     1         1
 8:     Downtown  Hennepin        4     2         2
 9:    Dinkytown   12thAve        1     1         1
10:    Dinkytown   14thAve        2     1         1
11:    Dinkytown   12thAve        3     2         2

最后,我们需要将结果再次展开到原来的行数。这是通过临时data.table与原始DF(包括所有行)的滚动连接来实现的:

setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][
      walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]

也许,值得注意的是 visit_new 被反复使用以在最终更新之前的各个阶段保存临时数据。

新数据集

固定代码也适用于 OP 提供的第二个数据集:

walking_dat_2 <-
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
"Uptown"), class = "factor"), street = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve", 
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1, 
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit = c(1, 1, 1, 1, 1, 
1, 1, 1, 2, 2, 1, 1, 1), visit_new = c(1L, 1L, 1L, 1L, 1L, 2L, 
3L, 4L, 5L, 5L, 6L, 6L, 6L)), .Names = c("neighborhood", "street", 
"sequence", "visit", "visit_new"), row.names = c(NA, -13L), class = "data.frame")

setDT(walking_dat_2)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][
      walking_dat_2, on = .(neighborhood, street, sequence), 
      roll = TRUE, .SD]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown  Hennepin        2     1         1
 3:       Uptown  Hennepin        3     1         1
 4:       Uptown  Hennepin        4     1         1
 5:       Uptown  Hennepin        5     1         1
 6:     Downtown  Hennepin        1     1         1
 7:     Downtown Marquette        2     1         1
 8:     Downtown Nicolette        3     1         1
 9:     Downtown  Hennepin        4     2         2
10:     Downtown  Hennepin        5     2         2
11:    Dinkytown   12thAve        1     1         1
12:    Dinkytown   12thAve        2     1         1
13:    Dinkytown   12thAve        3     1         1