具有变量 n 的滞后函数

Question

我在使用 dplyr 中的 lag 函数时遇到一些问题。这是我的数据集。

ID <- c(100, 100, 100, 200, 200, 300, 300)
daytime <- c("2010-12-21 06:00:00", "2010-12-21 09:00:00", "2010-12-21 13:00:00 ", "2010-12-23 23:00:00", "2010-12-24 02:00:00", "2010-12-25 19:00:00", "2010-12-31 08:00:00")
lagfirstvisit <- c(0, 0, 2, 0, 1, 0, 0) 
table <- cbind(ID, daytime, lagfirstvisit) 
table <- as.data.frame(table)
table$daytime <- as.POSIXct(table$daytime)

我的目标是生成一个新列，变量 daytime 的滞后数为 lagfirstvisit 列中指示的数字。即如果 lagfirstvisit == 2，我想要特定 ID 的 lag2 daytime 值。如果lagfirstvisit == 0，则表示保留观察行的原始daytime值。

我的预期结果如下：

ID <- c(100, 100, 100, 200, 200, 300, 300)
daytime <- c("2010-12-21 06:00:00", "2010-12-21 09:00:00", "2010-12-21 13:00:00 ", "2010-12-23 23:00:00", "2010-12-24 02:00:00", "2010-12-25 19:00:00", "2010-12-31 08:00:00")
lagfirstvisit <- c(0, 0, 2, 0, 1, 0, 0) 
result <- c("2010-12-21 06:00:00", "2010-12-21 09:00:00", "2010-12-21 06:00:00", "2010-12-23 23:00:00", "2010-12-23 23:00:00", "2010-12-25 19:00:00", "2010-12-31 08:00:00")
table.results <- cbind(ID, daytime, lagfirstvisit, result)

目前我使用的代码是：

table <- table %>%  
group_by(ID) %>% 
mutate(result = lag(as.POSIXct(daytime, format="%m/%d/%Y %H:%M:%S", tz= "UTC"), n = as.integer(lagfirstvisit)))

但是，我收到错误消息：

Error in mutate_impl(.data, dots) : Evaluation error: n must be a non-negative integer scalar, not integer of length 3.

有没有人知道我该如何解决这个问题？非常感谢！

Answer 1

    table%>%
      mutate_all(~as.numeric(as.character(.x)))%>%#First ensure all columns are numeric
      mutate(result=day[1:n()-lagfirstvisit])# you can also use row_number() instead of 1:n()

  ID day lagfirstvisit result
1 100  21             0     21
2 100  22             0     22
3 100  23             2     21
4 200  12             0     12
5 200  13             1     12
6 300  19             0     19
7 300  22             0     22

注意：不要使用内置函数名作为变量名。例如，您不应该使用名称 table，因为这是基数 r

中的一个函数

编辑：

对于新数据，程序保持不变，只要 lagfirstvisit 是数字：

table%>%
   mutate(result=daytime[1:n()-as.numeric(as.character(lagfirstvisit))])
   ID             daytime lagfirstvisit              result
1 100 2010-12-21 06:00:00             0 2010-12-21 06:00:00
2 100 2010-12-21 09:00:00             0 2010-12-21 09:00:00
3 100 2010-12-21 13:00:00             2 2010-12-21 06:00:00
4 200 2010-12-23 23:00:00             0 2010-12-23 23:00:00
5 200 2010-12-24 02:00:00             1 2010-12-23 23:00:00
6 300 2010-12-25 19:00:00             0 2010-12-25 19:00:00
7 300 2010-12-31 08:00:00             0 2010-12-31 08:00:00

Answer 2

table.results %>%
  group_by(ID) %>%
  mutate(
    result2=mapply(`[`, list(day), row_number() - lagfirstvisit)
  )
# A tibble: 7 x 5
# Groups:   ID [3]
     ID   day lagfirstvisit result result2
  <dbl> <dbl>         <dbl>  <dbl>   <dbl>
1  100.   21.            0.    21.     21.
2  100.   22.            0.    22.     22.
3  100.   23.            2.    21.     21.
4  200.   12.            0.    12.     12.
5  200.   13.            1.    12.     12.
6  300.   19.            0.    19.     19.
7  300.   22.            0.    22.     22.

Answer 3

我认为这比当前答案更清晰：

table %>%
  group_by(ID, lagfirstvisit) %>%
  mutate(result = dplyr::lag(daytime, n = lagfirstvisit[1])) %>%
  ungroup()

因为它是分组的lagfirstvisit所有索引都是相同的，所以取第一个就可以了。

具有变量 n 的滞后函数

lag function with variable n

r

lag

lead

dplyr