如何在组内将第一个值与每个后续值进行比较,直到满足条件

How to compare within a group the first value to each subsequent value until a condition is met

所以我在下面的一般结构中有一个数据框:

数据框:

rownum group date
1 a 2021-05-01
2 a 2021-05-02
3 a 2021-05-03
4 b 2021-05-15
5 b 2021-05-17
6 b 2021-05-30
7 b 2021-05-31
8 b 2021-05-31
9 c 2021-05-01
10 c 2021-05-05

我想做的是,在组内,将第一行与下一行进行比较,直到日期之间的差异达到某个阈值,比如 10 天。然后,一旦该行达到阈值,我想针对后续行测试下一行。它看起来像这样:

结果,使用阈值 10:

|rownum|group |date       |date diff|
|------|------|-----------|---|        
|1     | a    |2021-05-01 |NA|
|2     | a    |2021-05-02 |1|
|3     | a    |2021-05-03 |2|
|4     | b    |2021-05-15 |NA|
|5     | b    |2021-05-17 |2|
|6     | b    |2021-05-30 |15 (meets criteria, start from row 7 now)|
|7     | b    |2021-05-31 | NA|
|8     | b    |2021-05-31 | 0|
|9     | c    |2021-05-01 | NA|
|10    | c    |2021-05-05 |  4|

所以重申一下,它将一组的第一行与后续行进行比较,直到满足某个阈值。然后从组内的第一个代表开始计数到组内的后续行。差异记录为datediff。

我已经试过了,但我不知道 sapply 是否可行:

dataframe %>% 
    group_by(group) %>%
    mutate(
        datediff = sapply(date, function(x) {
            all(difftime(dataframe$date,dplyr::lag(dataframe, n = 1, default = NA)))
                }
        )
    )

也试过这个,我觉得更接近我想要的:

for (m in 1:length(dataframe)) {
    dataframe <- dataframe %>% 
        group_by(group) %>% 
        rowwise() %>% 
        mutate(datediff = difftime(dataframe$date,dplyr::lag(date, n = m, default = NA), units="days"))
    }

到目前为止,我还没有能够获得正确的行向比较来实现阈值位。

这是一种获取所需内容的迂回方式,其中一些 NA 使用此解决方案设置为 0

library(tidyverse)

df %>% 
  group_by(group) %>% 
  mutate(date = as.Date(date),
         date_diff = date - first(date),
         flag = date_diff > 10) %>% 
  group_by(group, flag) %>% 
  mutate(temp_group = cur_group_id()) %>% 
  group_by(temp_group) %>% 
  mutate(date_diff = case_when(date_diff == first(date_diff) ~ date_diff,
                               date_diff != first(date_diff) & date_diff < 10 ~ date - first(date),
                               date_diff != first(date_diff) & date_diff > 10 ~ date - nth(date, 2))) %>% 
  ungroup() %>% 
  select(group, date, date_diff) 


# A tibble: 10 x 3
   group date       date_diff
   <chr> <date>     <drtn>   
 1 a     2021-05-01  0 days  
 2 a     2021-05-02  1 days  
 3 a     2021-05-03  2 days  
 4 b     2021-05-15  0 days  
 5 b     2021-05-17  2 days  
 6 b     2021-05-30 15 days  
 7 b     2021-05-31  0 days  
 8 b     2021-05-31  0 days  
 9 c     2021-05-01  0 days  
10 c     2021-05-05  4 days 

基础 R

func <- function(x, threshold = 10) {
  r <- rle(c(0, diff(x)) > threshold)
  if ((len <- length(r$values)) > 1) {
    r$lengths[len] <- r$lengths[len] - 1L
    r$lengths[1] <- r$lengths[1] + 1L
  }
  cumsum(inverse.rle(r))
}
dat$group2 <- ave(as.numeric(dat$date), dat$group, FUN = func)
dat$datediff <- ave(as.numeric(dat$date), dat[,c("group", "group2")], FUN = function(x) c(NA, (x - x[1])[-1]))
dat$group2 <- NULL
dat
#    rownum group       date datediff
# 1       1     a 2021-05-01       NA
# 2       2     a 2021-05-02        1
# 3       3     a 2021-05-03        2
# 4       4     b 2021-05-15       NA
# 5       5     b 2021-05-17        2
# 6       6     b 2021-05-30       15
# 7       7     b 2021-05-31       NA
# 8       8     b 2021-05-31        0
# 9       9     c 2021-05-01       NA
# 10     10     c 2021-05-05        4

dplyr

library(dplyr)
dat %>%
  group_by(group) %>%
  mutate(group2 = func(date)) %>%
  group_by(group, group2) %>%
  mutate(datediff = c(NA, (date - date[1])[-1])) %>%
  ungroup() %>%
  select(-group2)
# # A tibble: 10 x 4
#    rownum group date       datediff
#     <int> <chr> <date>        <dbl>
#  1      1 a     2021-05-01       NA
#  2      2 a     2021-05-02        1
#  3      3 a     2021-05-03        2
#  4      4 b     2021-05-15       NA
#  5      5 b     2021-05-17        2
#  6      6 b     2021-05-30       15
#  7      7 b     2021-05-31       NA
#  8      8 b     2021-05-31        0
#  9      9 c     2021-05-01       NA
# 10     10 c     2021-05-05        4

数据

dat <- structure(list(rownum = 1:10, group = c("a", "a", "a", "b", "b", "b", "b", "b", "c", "c"), date = structure(c(18748, 18749, 18750, 18762, 18764, 18777, 18778, 18778, 18748, 18752), class = "Date")), row.names = c(NA, -10L), class = "data.frame")

(我已经将 dat$date 转换为 Date-class。)

另一个tidyverse解决方案。我们可以使用 accumulate 来完成这个任务。 dat 来自 r2evans 的例子。

library(tidyverse)

dat2 <- dat %>%
  group_by(group) %>%
  mutate(diff_lag = as.integer(date - lag(date))) %>%
  mutate(diff = accumulate(diff_lag, function(x, y){
    if (is.na(x)){
      res <- y
    }  else if (x > 10){
      res <- NA
    } else {
      res <- x + y
    }
    return(res)
  })) %>%
  select(-diff_lag) %>%
  ungroup()

dat2
# # A tibble: 10 x 4
#    rownum group date        diff
# <    int> <chr> <date>     <int>
#  1      1 a     2021-05-01    NA
#  2      2 a     2021-05-02     1
#  3      3 a     2021-05-03     2
#  4      4 b     2021-05-15    NA
#  5      5 b     2021-05-17     2
#  6      6 b     2021-05-30    15
#  7      7 b     2021-05-31    NA
#  8      8 b     2021-05-31     0
#  9      9 c     2021-05-01    NA
# 10     10 c     2021-05-05     4