每次数据框列的值发生变化时如何枚举?

How to enumerate each time there is a change in value of a dataframe column?

我在 DFSurv 下有一个数据框,我想创建列事件:

if TF[i]==TF[i-1] then Event[i] = Event[i-1] 
else Event[i] = Event[i-1] + 1

这应该对每个组 prov 完成,并且滞后是按列 Per 排序计算的。

主要思想是每当 TF 值发生变化时加一。

set.seed(1)
DFSurv = data.frame(Per = c(1:10,1:10,1:10, 1:10),
                    prov = c(rep("A",10),rep("B",10), rep("D",10),rep("F",10)),
                    TF = sample(0:1,size=40,replace=TRUE))

当使用 dplyr::lag 我得到错误的结果:

DFSurv %>% mutate(Event = 0) %>%
  arrange(prov, Per) %>%
  group_by(prov) %>%
  mutate(Event = if_else(TF == dplyr::lag(TF, default =0), 
                         dplyr::lag(Event, default =0), 
                         dplyr::lag(Event, default =0)+1))


# A tibble: 40 x 4
# Groups:   prov [4]
     Per prov     TF Event
   <int> <chr> <int> <dbl>
 1     1 A         0     0
 2     2 A         1     1
 3     3 A         0     1
 4     4 A         0     0
 5     5 A         1     1
 6     6 A         0     1
 7     7 A         0     0
 8     8 A         0     0
 9     9 A         1     1
10    10 A         1     0
# ... with 30 more rows

这些结果是错误的,因为 Event[3] TF[3] != TF[2] 因此值应该是 Event[2]+1,即 2。

这可以通过循环来完成,但首选矢量化方法。

试试这个:

library(tidyverse)

set.seed(1)

DFSurv <- data.frame(
  Per = c(1:10, 1:10, 1:10, 1:10),
  prov = c(rep("A", 10), rep("B", 10), rep("D", 10), rep("F", 10)),
  TF = sample(0:1, size = 40, replace = TRUE)
)

DFSurv %>%
  arrange(prov, Per) %>%
  group_by(prov) %>%
  mutate(event = if_else(TF != lag(TF) & !is.na(lag(TF)), 1, 0),
         event_cum = cumsum(event))
#> # A tibble: 40 × 5
#> # Groups:   prov [4]
#>      Per prov     TF event event_cum
#>    <int> <chr> <int> <dbl>     <dbl>
#>  1     1 A         0     0         0
#>  2     2 A         1     1         1
#>  3     3 A         0     1         2
#>  4     4 A         0     0         2
#>  5     5 A         1     1         3
#>  6     6 A         0     1         4
#>  7     7 A         0     0         4
#>  8     8 A         0     0         4
#>  9     9 A         1     1         5
#> 10    10 A         1     0         5
#> # … with 30 more rows

reprex package (v2.0.1)

创建于 2022-05-01

解决你问题的本质是cumsum

请注意,我的 set.seed 结果与您的不同。

library(dplyr)

set.seed(1)
DFSurv = data.frame(Per = c(1:10,1:10,1:10, 1:10),
                    prov = c(rep("A",10),rep("B",10), rep("D",10),rep("F",10)),
                    TF = sample(0:1,size=40,replace=TRUE))

DFSurv %>% 
  group_by(prov) %>% 
  mutate(Event = cumsum(abs(c(0, diff(TF)))))
#> # A tibble: 40 × 4
#> # Groups:   prov [4]
#>      Per prov     TF Event
#>    <int> <chr> <int> <dbl>
#>  1     1 A         0     0
#>  2     2 A         1     1
#>  3     3 A         0     2
#>  4     4 A         0     2
#>  5     5 A         1     3
#>  6     6 A         0     4
#>  7     7 A         0     4
#>  8     8 A         0     4
#>  9     9 A         1     5
#> 10    10 A         1     5
#> # … with 30 more rows

reprex package (v2.0.1)

于 2022-05-01 创建