计算每组某个变量先前出现的次数并存储为新列
Counting previous occurrences of certain variable per group and storing as new column
我想创建五个新列来计算在该特定行之前该业务出现特定“星号”值的频率(即,对具有较小 rolingcount 但保持业务不变的所有行求和) .
对于每项业务的第一行(即 rolingcount == 0),它应该是 NA,因为该业务以前没有出现过。
这是一个示例数据集:
business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
df <- cbind(business, rolingcount, stars)
我觉得我的问题与此有关,但要点是我无法开始工作:Numbering rows within groups in a data frame
我也尝试过 while 循环,但没有成功。
理想情况下,输出应该是这样的。 (我省略了 previousthree、previoustwo、previousone,因为我相信它们的工作方式相同)。
business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`
因为我必须对超过 1000 万行执行此操作,快速选项会很酷。非常感谢您的帮助! :)
不知道这个选项是不是真的很快,我不习惯处理那么多行。
这是在 tidyverse 中使用 dplyr 包的解决方案:
library(tidyverse)
df %>%
as.data.frame() %>%
group_by(business) %>%
mutate(stars = as.numeric(stars),
lag_stars = lag(stars, 1, default = 0),
previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>%
ungroup() %>%
select(-lag_stars)
输出:
# A tibble: 16 x 8
business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 aaa 1 5 NA NA NA NA NA
2 aaa 2 5 1 0 0 0 0
3 aaa 3 3 2 0 0 0 0
4 bbb 1 5 NA NA NA NA NA
5 bbb 2 5 1 0 0 0 0
6 bbb 3 1 2 0 0 0 0
7 bbb 4 2 2 0 0 0 1
8 bbb 5 3 2 0 0 1 1
9 ccc 1 5 NA NA NA NA NA
10 ccc 2 1 1 0 0 0 0
11 ccc 3 2 1 0 0 0 1
12 ccc 4 3 1 0 0 1 1
13 ccc 5 4 1 0 1 1 1
14 ccc 6 5 1 1 1 1 1
15 ccc 7 5 2 1 1 1 1
16 ccc 8 5 3 1 1 1 1
基本上,group_by就是对每一个业务进行运算,做一个累积滞后和。
如果它太慢,也许它会让你想到另一个更快的想法。
希望对你有帮助。
我想创建五个新列来计算在该特定行之前该业务出现特定“星号”值的频率(即,对具有较小 rolingcount 但保持业务不变的所有行求和) .
对于每项业务的第一行(即 rolingcount == 0),它应该是 NA,因为该业务以前没有出现过。
这是一个示例数据集:
business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
df <- cbind(business, rolingcount, stars)
我觉得我的问题与此有关,但要点是我无法开始工作:Numbering rows within groups in a data frame
我也尝试过 while 循环,但没有成功。
理想情况下,输出应该是这样的。 (我省略了 previousthree、previoustwo、previousone,因为我相信它们的工作方式相同)。
business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`
因为我必须对超过 1000 万行执行此操作,快速选项会很酷。非常感谢您的帮助! :)
不知道这个选项是不是真的很快,我不习惯处理那么多行。 这是在 tidyverse 中使用 dplyr 包的解决方案:
library(tidyverse)
df %>%
as.data.frame() %>%
group_by(business) %>%
mutate(stars = as.numeric(stars),
lag_stars = lag(stars, 1, default = 0),
previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>%
ungroup() %>%
select(-lag_stars)
输出:
# A tibble: 16 x 8
business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 aaa 1 5 NA NA NA NA NA
2 aaa 2 5 1 0 0 0 0
3 aaa 3 3 2 0 0 0 0
4 bbb 1 5 NA NA NA NA NA
5 bbb 2 5 1 0 0 0 0
6 bbb 3 1 2 0 0 0 0
7 bbb 4 2 2 0 0 0 1
8 bbb 5 3 2 0 0 1 1
9 ccc 1 5 NA NA NA NA NA
10 ccc 2 1 1 0 0 0 0
11 ccc 3 2 1 0 0 0 1
12 ccc 4 3 1 0 0 1 1
13 ccc 5 4 1 0 1 1 1
14 ccc 6 5 1 1 1 1 1
15 ccc 7 5 2 1 1 1 1
16 ccc 8 5 3 1 1 1 1
基本上,group_by就是对每一个业务进行运算,做一个累积滞后和。 如果它太慢,也许它会让你想到另一个更快的想法。 希望对你有帮助。