如何使用 R 识别和删除 data.frame 中的异常值?
How to identify and remove outliers in a data.frame using R?
我有一个包含多个异常值的数据框。我怀疑这些 ouliers 产生了与预期不同的结果。
我尝试使用这个提示,但它没有用,因为我仍然有非常不同的值:https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
我尝试了 rstatix
包的解决方案,但我无法从 data.frame
中删除异常值
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
identify_outliers
期望 data.frame 作为输入,即用法是
identify_outliers(data, ..., variable = NULL)
哪里
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
根据经验,高于 Q3 + 1.5xIQR 或低于 Q1 - 1.5xIQR 的数据点被视为异常值。
因此,您只需要识别它们并删除它们。我不知道如何使用依赖关系 rstatix 来做到这一点,但是可以按照下面的示例使用 base R 来实现:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
做一个更酷的函数,returns 给你异常值的索引:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]
我有一个包含多个异常值的数据框。我怀疑这些 ouliers 产生了与预期不同的结果。
我尝试使用这个提示,但它没有用,因为我仍然有非常不同的值:https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
我尝试了 rstatix
包的解决方案,但我无法从 data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
identify_outliers
期望 data.frame 作为输入,即用法是
identify_outliers(data, ..., variable = NULL)
哪里
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
根据经验,高于 Q3 + 1.5xIQR 或低于 Q1 - 1.5xIQR 的数据点被视为异常值。 因此,您只需要识别它们并删除它们。我不知道如何使用依赖关系 rstatix 来做到这一点,但是可以按照下面的示例使用 base R 来实现:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
做一个更酷的函数,returns 给你异常值的索引:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]