如何动态地进行单变量异常值处理
How to do univariate outlier treatment dynamically
假设我有以下数据:
df<-iris[,1:2]# taking only 2 numeric columns
现在我想进行单变量离群值测试,其中我将离群值定义为任何大于 1.5 * IQR.Then 的数据或下端的 5%,如下所示:
a <- df$Sepal.Length
qnt_a <- quantile(a, probs = c(0.25,0.75))
caps_a <- quantile(a, probs = c(0.05,0.95))
H_a <- 1.5 * IQR(a)
a[a < (qnt_a[1] - H_a)] <- caps_a[1]
a[a > (qnt_a[1] + H_a)] <- caps_a[2]
df$Sepal.Length <- a
类似地我对其他剩余的数字变量做的:
b <- df$Sepal.Width
qnt_b <- quantile(a, probs = c(0.25,0.75))
caps_b <- quantile(a, probs = c(0.05,0.95))
H_b <- 1.5 * IQR(b)
b[b < (qnt_b[1] - H_b)] <- caps_b[1]
b[b > (qnt_b[1] + H_b)] <- caps_b[2]
df$Sepal.Width <- b
df
我想帮助制定一个循环,在这个循环中我可以对数据框中所有数字变量的异常值进行识别和封顶,而不是逐个变量......
最简单的方法是把它变成一个函数并应用它,即
f1 <- function(a){
qnt_a <- quantile(a, probs = c(0.25,0.75))
caps_a <- quantile(a, probs = c(0.05,0.95))
H_a <- 1.5 * IQR(a)
a[a < (qnt_a[1] - H_a)] <- caps_a[1]
a[a > (qnt_a[1] + H_a)] <- caps_a[2]
return(a)
}
df[] <- lapply(df, f1)
假设我有以下数据:
df<-iris[,1:2]# taking only 2 numeric columns
现在我想进行单变量离群值测试,其中我将离群值定义为任何大于 1.5 * IQR.Then 的数据或下端的 5%,如下所示:
a <- df$Sepal.Length
qnt_a <- quantile(a, probs = c(0.25,0.75))
caps_a <- quantile(a, probs = c(0.05,0.95))
H_a <- 1.5 * IQR(a)
a[a < (qnt_a[1] - H_a)] <- caps_a[1]
a[a > (qnt_a[1] + H_a)] <- caps_a[2]
df$Sepal.Length <- a
类似地我对其他剩余的数字变量做的:
b <- df$Sepal.Width
qnt_b <- quantile(a, probs = c(0.25,0.75))
caps_b <- quantile(a, probs = c(0.05,0.95))
H_b <- 1.5 * IQR(b)
b[b < (qnt_b[1] - H_b)] <- caps_b[1]
b[b > (qnt_b[1] + H_b)] <- caps_b[2]
df$Sepal.Width <- b
df
我想帮助制定一个循环,在这个循环中我可以对数据框中所有数字变量的异常值进行识别和封顶,而不是逐个变量......
最简单的方法是把它变成一个函数并应用它,即
f1 <- function(a){
qnt_a <- quantile(a, probs = c(0.25,0.75))
caps_a <- quantile(a, probs = c(0.05,0.95))
H_a <- 1.5 * IQR(a)
a[a < (qnt_a[1] - H_a)] <- caps_a[1]
a[a > (qnt_a[1] + H_a)] <- caps_a[2]
return(a)
}
df[] <- lapply(df, f1)