在 R 中,如何按一列分组并有条件地对另一列求和?

In R, how can I group by one column and conditionally sum another?

这是对我之前问题的补充:

假设我有下面的数据框。在我之前的问题中,我问过如何在每一行计算该行的客户随后订购产品 X(字面意思是 X,而不是与该行关联的产品)的次数,现在在 nSubsqX 中给出。现在,我想知道与 X 的那些后续订单相关的成本总和。我已将答案手动输入到下面的 nCostSubsqX 中,但我不知道如何以编程方式进行。

   Date       Customer Product  cost nSubsqX nCostSubsqX
 1 2020-05-18 A        X           9       0           0
 2 2020-02-10 B        X           2       5          42
 3 2020-02-12 B        Y           3       5          42
 4 2020-03-04 B        Z           4       5          42
 5 2020-03-29 B        X           5       4          37
 6 2020-04-08 B        X           6       3          31
 7 2020-04-30 B        X           7       2          24
 8 2020-05-13 B        X           8       1           5
 9 2020-05-23 B        Y          10       1           5
10 2020-07-02 B        Y          11       1           5
11 2020-08-26 B        Y          12       1           5
12 2020-12-06 B        X          16       0           0
13 2020-01-31 C        X           1       3          42
14 2020-09-19 C        X          13       2          60
15 2020-10-13 C        X          14       1          15
16 2020-11-11 C        X          15       0           0
17 2020-12-26 C        Y          17       0           0

为了提供 Reprex,下面是创建数据框的代码。

df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12", 
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18", 
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11", 
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))

df$cost = seq(nrow(df))

下面是获取 nSubsqX 的代码:

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"))

现在我需要了解如何使数组成为 Product 为 X 的行,但来自成本列而不是来自 Product 列本身。 有什么想法吗?

尝试1,报错。

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
    nCostSubsqX = sum(cost[which(Product == "X")]) - cumsum(cost[which(Product == "X")]))
...
Error in `mutate_cols()`:
  Problem with `mutate()` column `nCostSubsqX`.
  `nCostSubsqX = sum(cost[which(Product == "X")]) - ...`.
  `nCostSubsqX` must be size 11 or 1, not 6.
  The error occurred in group 2: Customer = "B".

尝试 2,数学不正确。 nCostSubsqX 列需要删除此时的 cum 成本。

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
    nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()), 
                                        Product == "X", rev(seq_len(sum(cost[which(Product == "X")]))))))
...
   Date       Customer Product  cost nSubsqX nCostSubsqX
 1 2020-05-18 A        X           9       0           9
 2 2020-02-10 B        X           2       5          44
 3 2020-02-12 B        Y           3       5          44
 4 2020-03-04 B        Z           4       5          44
 5 2020-03-29 B        X           5       4          43
 6 2020-04-08 B        X           6       3          42
 7 2020-04-30 B        X           7       2          41
 8 2020-05-13 B        X           8       1          40
 9 2020-05-23 B        Y          10       1          40
10 2020-07-02 B        Y          11       1          40
11 2020-08-26 B        Y          12       1          40
12 2020-12-06 B        X          16       0          39
13 2020-01-31 C        X           1       3          43
14 2020-09-19 C        X          13       2          42
15 2020-10-13 C        X          14       1          41
16 2020-11-11 C        X          15       0          40
17 2020-12-26 C        Y          17       0          40

尝试3,我不知道这里的数学在做什么,但这是不对的!

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
    nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()), 
                       Product == "X", rev(seq_len(sum(cost[which(Product == "X")])))))-
                  zoo::na.locf0(ifelse(Product == "X",cumsum(cost[which(Product == "X")]),NA)))

尝试 1 就快完成了。保持行数很重要。将 cost[which(Product == "X")] 替换为 cost*(Product=="X")(肮脏的把戏)。 顺便提一句。 which 是不必要的。

代码段将是:

df %>%
  arrange(Customer, Date) %>%
  group_by(Customer) %>%
  mutate(
    nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
    nCostSubsqX = sum(cost[Product == "X"]) - cumsum(cost*(Product == "X")))

如果您有兴趣,这里有一个稍微不同的方法。

library(data.table)

f <- function(p,co=rep(1,length(p))) {
  sapply(seq_along(p), \(i) sum(co[-i:0][p[-i:0]=="X"]))
}

setDT(df)[
  order(Date,Customer),
  `:=`(nSubsqX = f(Product),nCostSubsqx=f(Product, cost)),
  by=Customer
]

在这种方法中,我实际上对 nSubsqXnCostSubsqx 使用相同的函数 f();唯一的区别是 cost 是作为 co 参数额外传递给 f(),还是使用默认的 co 参数。

输出:

          Date Customer Product  cost nSubsqX nCostSubsqx
        <Date>   <char>  <char> <int>   <num>       <int>
 1: 2020-01-31        C       X     1       3          42
 2: 2020-02-10        B       X     2       5          42
 3: 2020-02-12        B       Y     3       5          42
 4: 2020-03-04        B       Z     4       5          42
 5: 2020-03-29        B       X     5       4          37
 6: 2020-04-08        B       X     6       3          31
 7: 2020-04-30        B       X     7       2          24
 8: 2020-05-13        B       X     8       1          16
 9: 2020-05-18        A       X     9       0           0
10: 2020-05-23        B       Y    10       1          16
11: 2020-07-02        B       Y    11       1          16
12: 2020-08-26        B       Y    12       1          16
13: 2020-09-19        C       X    13       2          29
14: 2020-10-13        C       X    14       1          15
15: 2020-11-11        C       X    15       0           0
16: 2020-12-06        B       X    16       0           0
17: 2020-12-26        C       Y    17       0           0