删除不同组的异常值

Question

我是新来的，所以请宽待我:-)

我正在寻找一种解决方案，以删除同一列中因特定值不同而不同的离群值：

body_mass age 1 19 11 2 20 10 3 26 8 4 21 6 5 18 12 6 18 7 7 30 11 8 17 8 9 17 10 10 18 8

boxplot(body_mass~age, data = df, subset=age %in% c(0:22))$out
outliers <- boxplot(body_mass~age, data = df, subset=age %in% c(0:22))$out

df[which(df$body_mass %in% outliers),]
df <- df[-which(df$body_mass %in% outliers),]

但是尝试这种方式，删除了所有年龄段的所有值，即使它们只是一个年龄段的异常值 class

Answer 1

这实际上取决于您如何定义 "outlier"。但如果你愿意接受离群值是四分位距正负 1.5 倍的任何值，那么您可以使用以下方法按年龄组去除体重中的离群值。

此外，我假设您想将每个年龄段视为一个单独的组，因为您没有另外说明。

定义一个函数，用 NA 替换异常值。

#' Replace outliers
#'
#' Replace outliers with NA. Outliers are defined as values that fall outside plus or minus
#' 1.5 * IQR.
#'
#' @return Numeric vector of same length.
#' @param x Numeric vector to replace.
#' @param na.rm Should NA values be replaced beforehand?
#'
#' @export
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
  val <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - val)] <- NA
  y[x > (qnt[2] + val)] <- NA
  y
}

按年龄组和过滤器应用remove_outliers()。

library(dplyr)

df2 <- df %>% 
  group_by(age) %>% 
  mutate(body_mass = replace_outliers(body_mass)) %>% 
  ungroup() %>% 
  filter(!is.na(body_mass))

删除不同组的异常值

Remove outliers for different groups

r

outliers