如何使用 tidyverse 基于标准开发删除异常值?
how to remove outliers based on standard dev, using tidyverse?
我使用 tidyverse 包试用了这段代码,以根据 sd 过滤异常值。
rt_trimmed_data_Dec = data_Dec %>%
group_by(Time_of_Testing, Item_Type, Group) %>%
summarise(RT_mean = mean(RT, na.rm=TRUE), RT_sd = sd(RT, na.rm=TRUE))%>%
ungroup() %>%
mutate(rt_high = RT_mean + (2.5 * RT_sd)) %>%
mutate(rt_low = RT_mean - (2.5 * RT_sd))
然后,我尝试加入两个数据框,以应用过滤。
data_Dec_RT = data_Dec %>%
inner_join(rt_trimmed_data_Dec) %>%
filter(RT < rt_high) %>%
filter(RT > rt_low)
但后来我得到了这个错误
Error: `by` required, because the data sources have no common variables
Call rlang::last_error()
to see a backtrace. > rlang::last_error()
message: by
required, because the data sources have no common variables
class: rlang_error
backtrace:
1. dplyr::inner_join(., rt_trimmed_data_Dec)
9. dplyr:::common_by.NULL(by, x, y)
11. dplyr:::bad_args("by", "required, because the data sources have no common variables")
12. dplyr:::glubort(fmt_args(args), ..., .envir = .envir)
13. dplyr::inner_join(., rt_trimmed_data_Dec).
请问如何解决这个问题,非常感谢你的帮助。
我想你可以用
来做到这一点
library(dplyr)
data_Dec %>%
group_by(Time_of_Testing, Item_Type, Group) %>%
filter(between(RT, mean(RT, na.rm=TRUE) - (2.5 * sd(RT, na.rm=TRUE)),
mean(RT, na.rm=TRUE) + (2.5 * sd(RT, na.rm=TRUE))))
这很容易做到,只需使用 scale 对 RT 列进行 z 评分即可。
library(tidyverse)
samples = 50
Ps = 10
# data frame that contains participant numbers, and RT scores
data <- data.frame(participant = as.factor(rep(1:Ps, each = samples)),
RT = rnorm(n = samples*Ps, mean = 600, sd = 50))
data_noOutliers <- data %>%
group_by(participant) %>%
mutate(zRT = scale(RT)) %>%
filter(between(zRT,-2.5,+2.5))
我使用 tidyverse 包试用了这段代码,以根据 sd 过滤异常值。
rt_trimmed_data_Dec = data_Dec %>%
group_by(Time_of_Testing, Item_Type, Group) %>%
summarise(RT_mean = mean(RT, na.rm=TRUE), RT_sd = sd(RT, na.rm=TRUE))%>%
ungroup() %>%
mutate(rt_high = RT_mean + (2.5 * RT_sd)) %>%
mutate(rt_low = RT_mean - (2.5 * RT_sd))
然后,我尝试加入两个数据框,以应用过滤。
data_Dec_RT = data_Dec %>%
inner_join(rt_trimmed_data_Dec) %>%
filter(RT < rt_high) %>%
filter(RT > rt_low)
但后来我得到了这个错误
Error: `by` required, because the data sources have no common variables
Call
rlang::last_error()
to see a backtrace. > rlang::last_error() message:by
required, because the data sources have no common variables class:rlang_error
backtrace: 1. dplyr::inner_join(., rt_trimmed_data_Dec) 9. dplyr:::common_by.NULL(by, x, y) 11. dplyr:::bad_args("by", "required, because the data sources have no common variables") 12. dplyr:::glubort(fmt_args(args), ..., .envir = .envir) 13. dplyr::inner_join(., rt_trimmed_data_Dec).
请问如何解决这个问题,非常感谢你的帮助。
我想你可以用
来做到这一点library(dplyr)
data_Dec %>%
group_by(Time_of_Testing, Item_Type, Group) %>%
filter(between(RT, mean(RT, na.rm=TRUE) - (2.5 * sd(RT, na.rm=TRUE)),
mean(RT, na.rm=TRUE) + (2.5 * sd(RT, na.rm=TRUE))))
这很容易做到,只需使用 scale 对 RT 列进行 z 评分即可。
library(tidyverse)
samples = 50
Ps = 10
# data frame that contains participant numbers, and RT scores
data <- data.frame(participant = as.factor(rep(1:Ps, each = samples)),
RT = rnorm(n = samples*Ps, mean = 600, sd = 50))
data_noOutliers <- data %>%
group_by(participant) %>%
mutate(zRT = scale(RT)) %>%
filter(between(zRT,-2.5,+2.5))