排除基于 R 中多列的异常值? IQR法
Excluding outliers based on multiple columns in R ? IQR method
我目前正在尝试根据选定变量的子集排除异常值,以进行敏感性分析。我已经调整了此处可用的功能:calculating the outliers in R),但到目前为止没有成功(我仍然是新手 R 用户)。如果您有任何建议,请告诉我!
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011),
measure1 = rnorm(11, mean = 8, sd = 4),
measure2 = rnorm(11, mean = 40, sd = 5),
measure3 = rnorm(11, mean = 20, sd = 2),
measure4 = rnorm(11, mean = 9, sd = 3))
vars_of_interest <- c("measure1", "measure3", "measure4")
# define a function to remove outliers
FindOutliers <- function(data) {
lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)
# we identify extreme outliers
extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)
result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
}
# use the function to identify outliers
temp <- FindOutliers(df[vars_of_interest])
# remove the outliers
testData <- testData[-temp]
# show the data with the outliers removed
testData
分离关注点:
- 使用 IQR 方法识别数值向量中的异常值。这可以封装在一个接受向量的函数中。
- 从 data.frame 的几列中删除离群值。这是一个接受 data.frame.
的函数
我建议返回一个布尔向量而不是索引。这样,返回值就是数据的大小,这使得创建新列变得容易,例如df$outlier <- is_outlier(df$measure1)
.
请注意参数名称如何清楚地表明预期的输入类型:x
是数字向量的标准名称,而 df
显然是 data.frame。 cols
可能是列名的列表或向量。
我指出只使用 base R,但在现实生活中我会使用 dplyr
包来操作 data.frame。
#' Detect outliers using IQR method
#'
#' @param x A numeric vector
#' @param na.rm Whether to exclude NAs when computing quantiles
#'
is_outlier <- function(x, na.rm = FALSE) {
qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
lowerq <- qs[1]
upperq <- qs[2]
iqr = upperq - lowerq
extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)
# Return logical vector
x > extreme.threshold.upper | x < extreme.threshold.lower
}
#' Remove rows with outliers in given columns
#'
#' Any row with at least 1 outlier will be removed
#'
#' @param df A data.frame
#' @param cols Names of the columns of interest. Defaults to all columns.
#'
#'
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
cat("Removing outliers in column: ", col, " \n")
df <- df[!is_outlier(df[[col]]),]
}
df
}
有了这两个功能,就变得很简单了:
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011),
measure1 = rnorm(11, mean = 8, sd = 4),
measure2 = rnorm(11, mean = 40, sd = 5),
measure3 = rnorm(11, mean = 20, sd = 2),
measure4 = rnorm(11, mean = 9, sd = 3))
vars_of_interest <- c("measure1", "measure3", "measure4")
df_filtered <- remove_outliers(df, vars_of_interest)
#> Removing outliers in column: measure1
#> Removing outliers in column: measure3
#> Removing outliers in column: measure4
df_filtered
#> ID measure1 measure2 measure3 measure4
#> 1 1001 9.127817 40.10590 17.69416 8.6031175
#> 2 1002 18.196182 38.50589 23.65251 7.8630485
#> 3 1003 10.537458 37.97222 21.83248 6.0798316
#> 4 1004 5.590463 46.83458 21.75404 6.9589981
#> 5 1005 14.079801 38.47557 20.93920 -0.6370596
#> 6 1006 3.830089 37.19281 19.56507 6.2165156
#> 7 1007 14.644766 37.09235 19.78774 10.5133674
#> 8 1008 5.462400 41.02952 20.14375 13.5247993
#> 9 1009 5.215756 37.65319 22.23384 7.3131715
#> 10 1010 14.518045 48.97977 20.33128 9.9482211
#> 11 1011 1.594353 44.09224 21.32434 11.1561089
由 reprex package (v0.3.0)
于 2020-03-23 创建
我目前正在尝试根据选定变量的子集排除异常值,以进行敏感性分析。我已经调整了此处可用的功能:calculating the outliers in R),但到目前为止没有成功(我仍然是新手 R 用户)。如果您有任何建议,请告诉我!
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011),
measure1 = rnorm(11, mean = 8, sd = 4),
measure2 = rnorm(11, mean = 40, sd = 5),
measure3 = rnorm(11, mean = 20, sd = 2),
measure4 = rnorm(11, mean = 9, sd = 3))
vars_of_interest <- c("measure1", "measure3", "measure4")
# define a function to remove outliers
FindOutliers <- function(data) {
lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)
# we identify extreme outliers
extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)
result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
}
# use the function to identify outliers
temp <- FindOutliers(df[vars_of_interest])
# remove the outliers
testData <- testData[-temp]
# show the data with the outliers removed
testData
分离关注点:
- 使用 IQR 方法识别数值向量中的异常值。这可以封装在一个接受向量的函数中。
- 从 data.frame 的几列中删除离群值。这是一个接受 data.frame. 的函数
我建议返回一个布尔向量而不是索引。这样,返回值就是数据的大小,这使得创建新列变得容易,例如df$outlier <- is_outlier(df$measure1)
.
请注意参数名称如何清楚地表明预期的输入类型:x
是数字向量的标准名称,而 df
显然是 data.frame。 cols
可能是列名的列表或向量。
我指出只使用 base R,但在现实生活中我会使用 dplyr
包来操作 data.frame。
#' Detect outliers using IQR method
#'
#' @param x A numeric vector
#' @param na.rm Whether to exclude NAs when computing quantiles
#'
is_outlier <- function(x, na.rm = FALSE) {
qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
lowerq <- qs[1]
upperq <- qs[2]
iqr = upperq - lowerq
extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)
# Return logical vector
x > extreme.threshold.upper | x < extreme.threshold.lower
}
#' Remove rows with outliers in given columns
#'
#' Any row with at least 1 outlier will be removed
#'
#' @param df A data.frame
#' @param cols Names of the columns of interest. Defaults to all columns.
#'
#'
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
cat("Removing outliers in column: ", col, " \n")
df <- df[!is_outlier(df[[col]]),]
}
df
}
有了这两个功能,就变得很简单了:
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011),
measure1 = rnorm(11, mean = 8, sd = 4),
measure2 = rnorm(11, mean = 40, sd = 5),
measure3 = rnorm(11, mean = 20, sd = 2),
measure4 = rnorm(11, mean = 9, sd = 3))
vars_of_interest <- c("measure1", "measure3", "measure4")
df_filtered <- remove_outliers(df, vars_of_interest)
#> Removing outliers in column: measure1
#> Removing outliers in column: measure3
#> Removing outliers in column: measure4
df_filtered
#> ID measure1 measure2 measure3 measure4
#> 1 1001 9.127817 40.10590 17.69416 8.6031175
#> 2 1002 18.196182 38.50589 23.65251 7.8630485
#> 3 1003 10.537458 37.97222 21.83248 6.0798316
#> 4 1004 5.590463 46.83458 21.75404 6.9589981
#> 5 1005 14.079801 38.47557 20.93920 -0.6370596
#> 6 1006 3.830089 37.19281 19.56507 6.2165156
#> 7 1007 14.644766 37.09235 19.78774 10.5133674
#> 8 1008 5.462400 41.02952 20.14375 13.5247993
#> 9 1009 5.215756 37.65319 22.23384 7.3131715
#> 10 1010 14.518045 48.97977 20.33128 9.9482211
#> 11 1011 1.594353 44.09224 21.32434 11.1561089
由 reprex package (v0.3.0)
于 2020-03-23 创建