计算每组某个变量先前出现的次数并存储为新列

Counting previous occurrences of certain variable per group and storing as new column

我想创建五个新列来计算在该特定行之前该业务出现特定“星号”值的频率(即,对具有较小 rolingcount 但保持业务不变的所有行求和) .

对于每项业务的第一行(即 rolingcount == 0),它应该是 NA,因为该业务以前没有出现过。

这是一个示例数据集:

business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc") 
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8) 
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5) 
df <- cbind(business, rolingcount, stars)

我觉得我的问题与此有关,但要点是我无法开始工作:Numbering rows within groups in a data frame

我也尝试过 while 循环,但没有成功。

理想情况下,输出应该是这样的。 (我省略了 previousthree、previoustwo、previousone,因为我相信它们的工作方式相同)。

business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`

因为我必须对超过 1000 万行执行此操作,快速选项会很酷。非常感谢您的帮助! :)

不知道这个选项是不是真的很快,我不习惯处理那么多行。 这是在 tidyverse 中使用 dplyr 包的解决方案:

library(tidyverse)
df %>% 
  as.data.frame() %>% 
  group_by(business) %>% 
  mutate(stars = as.numeric(stars),
         lag_stars = lag(stars, 1, default = 0),
         previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
         previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
         previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
         previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
         previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>% 
  ungroup() %>% 
  select(-lag_stars)

输出:

# A tibble: 16 x 8
   business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
   <chr>    <chr>       <dbl>         <dbl>         <dbl>          <dbl>        <dbl>        <dbl>
 1 aaa      1               5            NA            NA             NA           NA           NA
 2 aaa      2               5             1             0              0            0            0
 3 aaa      3               3             2             0              0            0            0
 4 bbb      1               5            NA            NA             NA           NA           NA
 5 bbb      2               5             1             0              0            0            0
 6 bbb      3               1             2             0              0            0            0
 7 bbb      4               2             2             0              0            0            1
 8 bbb      5               3             2             0              0            1            1
 9 ccc      1               5            NA            NA             NA           NA           NA
10 ccc      2               1             1             0              0            0            0
11 ccc      3               2             1             0              0            0            1
12 ccc      4               3             1             0              0            1            1
13 ccc      5               4             1             0              1            1            1
14 ccc      6               5             1             1              1            1            1
15 ccc      7               5             2             1              1            1            1
16 ccc      8               5             3             1              1            1            1

基本上,group_by就是对每一个业务进行运算,做一个累积滞后和。 如果它太慢,也许它会让你想到另一个更快的想法。 希望对你有帮助。