按组计算“all”或“any”的累计和

Cumulative sum with `all` or `any` by group

考虑向量

group = rep(1:6, each = 2)
x = 1:12

现在,如果组中的任何成员满足条件,我想按组计算累计和。例如,条件是 x %% 3 == 0.

## Without the cumulative sum
ave(x, group, FUN = function(x) any(x %% 3 == 0)) 
# [1] 0 0 1 1 1 1 0 0 1 1 1 1

## With the cumulative sum
ave(x, group, FUN = function(x) cumsum(any(x %% 3 == 0)))
# [1] 0 0 1 1 1 1 0 0 1 1 1 1

##Expected result with cumsum:
# [1] 0 0 1 2 1 2 0 0 1 2 1 2

这也出现在dplyr

dWithoutCumsum <- data.frame(group, x) %>% 
  group_by(group) %>% 
  mutate(z = +any(x %% 3 == 0))

dWithCumsum <- data.frame(group, x) %>% 
  group_by(group) %>% 
  mutate(z = cumsum(any(x %% 3 == 0)))

all.equal(dWithCumsum,dWithoutCumsum)
# [1] TRUE

另外,后面设置cumsum函数时,一切正常:

ave(ave(x, group, FUN = function(x) any(x %% 3 == 0)), group, FUN = cumsum)
# [1] 0 0 1 2 1 2 0 0 1 2 1 2

data.frame(group, x) %>% 
  group_by(group) %>% 
  mutate(z = any(x %% 3 == 0),
         z = cumsum(z)) %>% 
  pull(z)
# [1] 0 0 1 2 1 2 0 0 1 2 1 2

为什么 cumsum 函数在这些情况下无法按预期工作(也不适用于 all 而不是 any),是吗?一行就能得到预期的结果?

我的理解是,如果您至少检测到 3 的倍数,则您希望 return 递增序列,否则为零向量。在那种情况下:

g <- gl(6, 2)
g
##  [1] 1  1  2  2  3  3  4  4  5  5  6  6
## Levels: 1 2 3 4 5 6

x <- seq_along(g)
x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

f <- function(x) if (any(x %% 3 == 0)) seq_along(x) else integer(length(x))

unsplit(tapply(x, g, f, simplify = FALSE), g)
## [1] 0 0 1 2 1 2 0 0 1 2 1 2

或者,在一个数据框中,dplyr:

library("dplyr")
d <- data.frame(g, x)
d %>% group_by(g) %>% mutate(y = f(x))
# A tibble: 12 × 3
# Groups:   g [6]
   g         x     y
   <fct> <int> <int>
 1 1         1     0
 2 1         2     0
 3 2         3     1
 4 2         4     2
 5 3         5     1
 6 3         6     2
 7 4         7     0
 8 4         8     0
 9 5         9     1
10 5        10     2
11 6        11     1
12 6        12     2

您实际上并没有在做 cumsum-- 不需要求和。您正在查找组内的行号。

这里有一些使用 dplyr 的方法:

df %>%
  group_by(group) %>%
  mutate(
    result1 = row_number() * any(y %% 3 == 0),
    result2 = case_when(
      any(y %% 3 == 0) ~ row_number(),
      TRUE ~ 0L
    )
  )
# # A tibble: 12 × 4
# # Groups:   group [6]
#    group     y result1 result2
#    <int> <int>   <int>   <int>
#  1     1     1       0       0
#  2     1     2       0       0
#  3     2     3       1       1
#  4     2     4       2       2
#  5     3     5       1       1
#  6     3     6       2       2
#  7     4     7       0       0
#  8     4     8       0       0
#  9     5     9       1       1
# 10     5    10       2       2
# 11     6    11       1       1
# 12     6    12       2       2