如何计算 r dplyr mutate 中的条件行数?

How can I count a number of conditional rows within r dplyr mutate?

我想获取一组数据,将其按一列分组,按另一列排序,然后计算某个事件的后续实例发生次数。例如,在下面的数据中......我想添加一个列,称之为 nSubsqX,它告诉我在每一行,对于那个客户,有多少 subsequent 订单有产品“X ”。第 1 行应该是第 3 行,因为行 13:15 都是客户 C,产品 X;第 9 行的结果应为 0,因为客户 A 没有后续订单。

         Date Customer Product
1  2020-01-31        C       X
2  2020-02-10        B       X
3  2020-02-12        B       Y
4  2020-03-04        B       Z
5  2020-03-29        B       X
6  2020-04-08        B       X
7  2020-04-30        B       X
8  2020-05-13        B       X
9  2020-05-18        A       X
10 2020-05-23        B       Y
11 2020-07-02        B       Y
12 2020-08-26        B       Y
13 2020-09-19        C       X
14 2020-10-13        C       X
15 2020-11-11        C       X
16 2020-12-06        B       X
17 2020-12-26        C       Y

为了提供 Reprex,下面是创建数据框的代码。

df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12", 
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18", 
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11", 
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))

我预计我将需要某种变异函数,但我不能完全正确,我试过了:

df2 = df %>%
  group_by(Customer) %>%
  arrange(Customer, Date) %>%
  mutate(
    nSubsqX = length(Customer[which(Product == "X")]))

它给出了“X”出现的总次数,但我想要的是这个随后的次数。我也试过:

df2 = df %>%
  group_by(Customer) %>%
  arrange(Customer, Date) %>%
  mutate(
    nSubsqX = length(Customer[which(Product == "X" & Date > Date)]))

returns 0 可能是因为 Date > Date 没有任何意义。我需要一种方法来表示日期 > THIS 日期。我试图实现的解决方案如下所示:

   Date       Customer Product nSubsqX
 1 2020-05-18 A        X             0
 2 2020-02-10 B        X             5
 3 2020-02-12 B        Y             5
 4 2020-03-04 B        Z             5
 5 2020-03-29 B        X             4
 6 2020-04-08 B        X             3
 7 2020-04-30 B        X             2
 8 2020-05-13 B        X             1
 9 2020-05-23 B        Y             1
10 2020-07-02 B        Y             1
11 2020-08-26 B        Y             1
12 2020-12-06 B        X             0
13 2020-01-31 C        X             3
14 2020-09-19 C        X             2
15 2020-10-13 C        X             1
16 2020-11-11 C        X             0
17 2020-12-26 C        Y             0

我认为这只是一个甚至不知道要搜索什么词的问题,所以我确信如果我能找出正确的搜索条件,那里有一些东西可以告诉我该怎么做。我感谢任何人给我指出正确的方向。

谢谢!

这样的事情怎么样:

library(data.table)
setDT(df)[order(Customer,Date)] %>% 
  .[Product=="X", nSubsqX:=.N-1:.N, by=.(Customer)] %>% 
  .[order(Customer,Date),nSubsqX:=zoo::na.locf(nSubsqX)] %>% 
  .[]

输出:

          Date Customer Product    nSubsqX
        <Date>   <char>  <char> <int>
 1: 2020-05-18        A       X     0
 2: 2020-02-10        B       X     5
 3: 2020-02-12        B       Y     5
 4: 2020-03-04        B       Z     5
 5: 2020-03-29        B       X     4
 6: 2020-04-08        B       X     3
 7: 2020-04-30        B       X     2
 8: 2020-05-13        B       X     1
 9: 2020-05-23        B       Y     1
10: 2020-07-02        B       Y     1
11: 2020-08-26        B       Y     1
12: 2020-12-06        B       X     0
13: 2020-01-31        C       X     3
14: 2020-09-19        C       X     2
15: 2020-10-13        C       X     1
16: 2020-11-11        C       X     0
17: 2020-12-26        C       Y     0

data.table 解释:

  1. 使用setDT()将df设置为data.table
  2. 我们按客户顺序,然后按日期
  3. i 中,我们限制为 Product = "X"
  4. j 中,我们创建(按组)nSubsqX,方法是将值分配为组中的行数(即 .N)减去行号组的(可以作为序列动态生成,如 1:.N
  5. by中,我们设置分组列;在这种情况下,我们想按 Customer
  6. 分组
  7. 使用zoo::na.locf 填充

这是一个选项 tidyverse - arrange by 'Customer', 'Date', 然后按 'Customer', replace 向量分组NA 元素,其中 'Product' 是 'X',rev 是 'X' 值计数的错误序列,那么我们要么使用 tidyr::fill 要么可以使用 zoo::na.locf0 用以前的 non-NA 值

填充 NA 元素
library(dplyr)
df %>% 
  arrange(Customer, Date) %>% 
  group_by(Customer) %>% 
  mutate(new = zoo::na.locf0(replace(rep(NA_real_, n()), 
      Product == "X", rev(seq_len(sum(Product == "X")))-1))) %>%
  ungroup

-输出

# A tibble: 17 × 4
   Date       Customer Product   new
   <date>     <chr>    <chr>   <dbl>
 1 2020-05-18 A        X           0
 2 2020-02-10 B        X           5
 3 2020-02-12 B        Y           5
 4 2020-03-04 B        Z           5
 5 2020-03-29 B        X           4
 6 2020-04-08 B        X           3
 7 2020-04-30 B        X           2
 8 2020-05-13 B        X           1
 9 2020-05-23 B        Y           1
10 2020-07-02 B        Y           1
11 2020-08-26 B        Y           1
12 2020-12-06 B        X           0
13 2020-01-31 C        X           3
14 2020-09-19 C        X           2
15 2020-10-13 C        X           1
16 2020-11-11 C        X           0
17 2020-12-26 C        Y           0

类似的选项可以用data.table

完成
library(data.table)
setDT(df)[order(Customer, Date)][Product == "X", 
   nSubsqx := rev(seq_len(.N)) - 1, Customer][, 
      nSubsqx := nafill(nSubsqx, "locf"), Customer][]

-输出

ndex: <Product>
          Date Customer Product nSubsqx
        <Date>   <char>  <char>   <num>
 1: 2020-05-18        A       X       0
 2: 2020-02-10        B       X       5
 3: 2020-02-12        B       Y       5
 4: 2020-03-04        B       Z       5
 5: 2020-03-29        B       X       4
 6: 2020-04-08        B       X       3
 7: 2020-04-30        B       X       2
 8: 2020-05-13        B       X       1
 9: 2020-05-23        B       Y       1
10: 2020-07-02        B       Y       1
11: 2020-08-26        B       Y       1
12: 2020-12-06        B       X       0
13: 2020-01-31        C       X       3
14: 2020-09-19        C       X       2
15: 2020-10-13        C       X       1
16: 2020-11-11        C       X       0
17: 2020-12-26        C       Y       0

这是一个 dplyr 唯一的解决方案:

诀窍是从X的总和中减去X的分组数(例如cumsum(Product=="X")(例如每个Customer组中的sum(Product=="X")

library(dplyr)

  df %>%
    arrange(Customer, Date) %>%
    group_by(Customer) %>%
    mutate(nSubsqX1 = sum(Product=="X") - cumsum(Product=="X"))  
   Date       Customer Product nSubsqX1
   <date>     <chr>    <chr>      <int>
 1 2020-05-18 A        X              0
 2 2020-02-10 B        X              5
 3 2020-02-12 B        Y              5
 4 2020-03-04 B        Z              5
 5 2020-03-29 B        X              4
 6 2020-04-08 B        X              3
 7 2020-04-30 B        X              2
 8 2020-05-13 B        X              1
 9 2020-05-23 B        Y              1
10 2020-07-02 B        Y              1
11 2020-08-26 B        Y              1
12 2020-12-06 B        X              0
13 2020-01-31 C        X              3
14 2020-09-19 C        X              2
15 2020-10-13 C        X              1
16 2020-11-11 C        X              0
17 2020-12-26 C        Y              0