如何使用 tidyverse 基于标准开发删除异常值?

how to remove outliers based on standard dev, using tidyverse?

我使用 tidyverse 包试用了这段代码,以根据 sd 过滤异常值。

rt_trimmed_data_Dec = data_Dec %>%
 group_by(Time_of_Testing, Item_Type, Group) %>%
 summarise(RT_mean = mean(RT, na.rm=TRUE), RT_sd = sd(RT, na.rm=TRUE))%>%
 ungroup()  %>%
 mutate(rt_high = RT_mean + (2.5 * RT_sd)) %>%
  mutate(rt_low = RT_mean - (2.5 * RT_sd))

然后,我尝试加入两个数据框,以应用过滤。

data_Dec_RT = data_Dec %>%
   inner_join(rt_trimmed_data_Dec) %>%
   filter(RT < rt_high) %>%
    filter(RT > rt_low)

但后来我得到了这个错误

Error: `by` required, because the data sources have no common variables

Call rlang::last_error() to see a backtrace. > rlang::last_error() message: by required, because the data sources have no common variables class: rlang_error backtrace: 1. dplyr::inner_join(., rt_trimmed_data_Dec) 9. dplyr:::common_by.NULL(by, x, y) 11. dplyr:::bad_args("by", "required, because the data sources have no common variables") 12. dplyr:::glubort(fmt_args(args), ..., .envir = .envir) 13. dplyr::inner_join(., rt_trimmed_data_Dec).

请问如何解决这个问题,非常感谢你的帮助。

我想你可以用

来做到这一点
library(dplyr)
data_Dec %>%
  group_by(Time_of_Testing, Item_Type, Group) %>%
  filter(between(RT, mean(RT, na.rm=TRUE) - (2.5 * sd(RT, na.rm=TRUE)), 
                     mean(RT, na.rm=TRUE) + (2.5 * sd(RT, na.rm=TRUE))))

这很容易做到,只需使用 scale 对 RT 列进行 z 评分即可。

    library(tidyverse)

    samples = 50
    Ps = 10

    # data frame that contains participant numbers, and RT scores
    data <- data.frame(participant = as.factor(rep(1:Ps, each = samples)),
                       RT = rnorm(n = samples*Ps, mean = 600, sd = 50))

    data_noOutliers <- data %>% 
      group_by(participant) %>% 
      mutate(zRT = scale(RT)) %>% 
      filter(between(zRT,-2.5,+2.5))