R dplyr 基于总结的条件
R dplyr Summarising based condition
我有一组根据我们生成的报告从网站下载的项目。这个想法是根据下载次数删除不再需要的报告。逻辑基本上是计算去年所有已下载的报告,检查它们是否超出当年中位数的两个绝对偏差范围,检查报告是否在过去 4 周内下载,如果是,如何下载多次
我有下面的代码不起作用,我想知道是否有人可以提供帮助
它给了我错误:对于 n_recent_downloads 部分
FUN(X[[1L]], ...) 中的错误:仅在具有所有数字变量的数据框中定义
reports <- c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D")
Week_no <- c(36,36,33,32,20,18,36,30,29,27)
New.Downloads <- data.frame (Report1 = reports, DL.Week = Week_no)
test <- New.Downloads %>%
group_by(report1) %>%
summarise(n_downloads = n(),
n_recent_downloads = ifelse(sum((as.integer(DL.Week) >= (as.integer(max(DL.Week))) - 4),value,0)))
提供一个可重现的例子会让生活变得更轻松。尽管如此,我已经修改了您的代码以实现我认为您想要实现的目标。
我把它分成两部分,这样你就可以看到发生了什么。我将 ifelse
语句移至 mutate
调用,该调用给出:
library(dplyr)
New.Downloads <- data.frame(
Report1 = c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D"),
DL.Week = as.numeric(c(36,36,33,32,20,18,36,30,29,27))
)
test <- New.Downloads %>%
group_by(Report1) %>%
mutate(
median = median(DL.Week),
mad = 2 * mad(DL.Week),
check = ifelse(DL.Week > median + mad | DL.Week < median - mad, 0, DL.Week)
)
test
Source: local data frame [10 x 5]
Groups: Report1
Report1 DL.Week median mad check
1 Report_A 36 28.0 23.7216 36
2 Report_B 36 36.0 0.0000 36
3 Report_C 33 33.0 0.0000 33
4 Report_D 32 29.5 4.4478 32
5 Report_A 20 28.0 23.7216 20
6 Report_A 18 28.0 23.7216 18
7 Report_A 36 28.0 23.7216 36
8 Report_D 30 29.5 4.4478 30
9 Report_D 29 29.5 4.4478 29
10 Report_D 27 29.5 4.4478 27
请注意,根据您的示例,none 的值相对于 median + 2 * mad
标准被归类为极端值,因此 check
值与 DL.week
相同。
然后您可以将 summarise
链接到此结尾以得出总和。
test %>%
summarise(
n_recent_downloads = sum(check)
)
Source: local data frame [4 x 2]
Report1 n_recent_downloads
1 Report_A 110
2 Report_B 36
3 Report_C 33
4 Report_D 118
我有一组根据我们生成的报告从网站下载的项目。这个想法是根据下载次数删除不再需要的报告。逻辑基本上是计算去年所有已下载的报告,检查它们是否超出当年中位数的两个绝对偏差范围,检查报告是否在过去 4 周内下载,如果是,如何下载多次
我有下面的代码不起作用,我想知道是否有人可以提供帮助 它给了我错误:对于 n_recent_downloads 部分
FUN(X[[1L]], ...) 中的错误:仅在具有所有数字变量的数据框中定义
reports <- c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D")
Week_no <- c(36,36,33,32,20,18,36,30,29,27)
New.Downloads <- data.frame (Report1 = reports, DL.Week = Week_no)
test <- New.Downloads %>%
group_by(report1) %>%
summarise(n_downloads = n(),
n_recent_downloads = ifelse(sum((as.integer(DL.Week) >= (as.integer(max(DL.Week))) - 4),value,0)))
提供一个可重现的例子会让生活变得更轻松。尽管如此,我已经修改了您的代码以实现我认为您想要实现的目标。
我把它分成两部分,这样你就可以看到发生了什么。我将 ifelse
语句移至 mutate
调用,该调用给出:
library(dplyr)
New.Downloads <- data.frame(
Report1 = c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D"),
DL.Week = as.numeric(c(36,36,33,32,20,18,36,30,29,27))
)
test <- New.Downloads %>%
group_by(Report1) %>%
mutate(
median = median(DL.Week),
mad = 2 * mad(DL.Week),
check = ifelse(DL.Week > median + mad | DL.Week < median - mad, 0, DL.Week)
)
test
Source: local data frame [10 x 5]
Groups: Report1
Report1 DL.Week median mad check
1 Report_A 36 28.0 23.7216 36
2 Report_B 36 36.0 0.0000 36
3 Report_C 33 33.0 0.0000 33
4 Report_D 32 29.5 4.4478 32
5 Report_A 20 28.0 23.7216 20
6 Report_A 18 28.0 23.7216 18
7 Report_A 36 28.0 23.7216 36
8 Report_D 30 29.5 4.4478 30
9 Report_D 29 29.5 4.4478 29
10 Report_D 27 29.5 4.4478 27
请注意,根据您的示例,none 的值相对于 median + 2 * mad
标准被归类为极端值,因此 check
值与 DL.week
相同。
然后您可以将 summarise
链接到此结尾以得出总和。
test %>%
summarise(
n_recent_downloads = sum(check)
)
Source: local data frame [4 x 2]
Report1 n_recent_downloads
1 Report_A 110
2 Report_B 36
3 Report_C 33
4 Report_D 118