过滤多个 R data.table 列以消除异常值
Filter multiple R data.table columns to eliminate outliers
我想消除高于或低于 2 个标准差的离群值,因为许多变量具有相似的名称(太多无法在代码中单独指定)。
library(data.table)
irisdt <- data.table(iris)
myCols <- grep("Sepal", colnames(irisdt), value=TRUE)
# This works if I specify one column,
# but I have too many columns to specify, so need to use grep approach.
irisdt[, Sepal.Length.Outlier := (scale(Sepal.Length) < -2 | scale(Sepal.Length) > 2)]
# This does not work
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(x) < -2 | scale(x) > 2)} )]
# This partially works, but changes in place
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(irisdt[[x]]) < -2 | scale(irisdt[[x]]) > 2)} )]
# How do I make new variables, for example "Sepal.Length.Outlier"?
myOutlierCols <- grep(".Outlier", colnames(irisdt), value=TRUE)
# How do I select rows matching multiple columns (&)?
irisdt[myOutlierCols=="FALSE"] # does not work
irisdt[, hasOutlier := lapply(myCols, myCols==TRUE)] # does not work
irisdt[hasOutlier=="FALSE"] # relies on line above, which doesn't work
也许一个函数可以采用 data.table 列并去除高于或低于 z 分数截止值的值。这可以与 lapply.
一起使用
# This does not work
removeOutliers <- function(myColumn, cutoff = 3) {
lapply(myColumn, function (x) {
if (scale(myColumn[[x]]) < -cutoff | scale(myColumn[[x]]) > cutoff) {
x <- NA #specify individual value instead of column?
}
})
}
removeOutliers(irisdt[,Sepal.Length]) # for testing
trimmedIrisdt <- irisdt[,lapply(.SD, removeOutliers(.SD)), .SDcols = myCols] # could do by = grouping variable
# Once outliers are made NA, this would work:
trimmedIrisdt <- complete.cases(trimmedIrisdt)
我想这达到了目标:
irisdt[, keep :=
as.logical(do.call(pmin, lapply(.SD, function(x) abs(scale(x)) <= 2)))
, .SDcols = myCols]
res = irisdt[(keep), !"keep"]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
135: 6.7 3.0 5.2 2.3 virginica
136: 6.3 2.5 5.0 1.9 virginica
137: 6.5 3.0 5.2 2.0 virginica
138: 6.2 3.4 5.4 2.3 virginica
139: 5.9 3.0 5.1 1.8 virginica
如果有分组变量,这应该也能正常工作。我不知道它的统计可靠性。
工作原理:
- 测试每个单元格的
abs(scale(x)) <= 2
。
- 如果跨列的最小结果为 TRUE,则保留该行。
逐个查看其工作原理...
library(data.table)
mynewCols = paste0(myCols,"_outly")
irisdt[, (mynewCols) :=
lapply(.SD, function(x) replace(x, abs(scale(x)) <= 2, NA))
, .SDcols = myCols]
然后浏览View(irisdt[rowSums(!is.na(irisdt[, ..mynewCols])) > 0])
。
我想消除高于或低于 2 个标准差的离群值,因为许多变量具有相似的名称(太多无法在代码中单独指定)。
library(data.table)
irisdt <- data.table(iris)
myCols <- grep("Sepal", colnames(irisdt), value=TRUE)
# This works if I specify one column,
# but I have too many columns to specify, so need to use grep approach.
irisdt[, Sepal.Length.Outlier := (scale(Sepal.Length) < -2 | scale(Sepal.Length) > 2)]
# This does not work
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(x) < -2 | scale(x) > 2)} )]
# This partially works, but changes in place
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(irisdt[[x]]) < -2 | scale(irisdt[[x]]) > 2)} )]
# How do I make new variables, for example "Sepal.Length.Outlier"?
myOutlierCols <- grep(".Outlier", colnames(irisdt), value=TRUE)
# How do I select rows matching multiple columns (&)?
irisdt[myOutlierCols=="FALSE"] # does not work
irisdt[, hasOutlier := lapply(myCols, myCols==TRUE)] # does not work
irisdt[hasOutlier=="FALSE"] # relies on line above, which doesn't work
也许一个函数可以采用 data.table 列并去除高于或低于 z 分数截止值的值。这可以与 lapply.
一起使用# This does not work
removeOutliers <- function(myColumn, cutoff = 3) {
lapply(myColumn, function (x) {
if (scale(myColumn[[x]]) < -cutoff | scale(myColumn[[x]]) > cutoff) {
x <- NA #specify individual value instead of column?
}
})
}
removeOutliers(irisdt[,Sepal.Length]) # for testing
trimmedIrisdt <- irisdt[,lapply(.SD, removeOutliers(.SD)), .SDcols = myCols] # could do by = grouping variable
# Once outliers are made NA, this would work:
trimmedIrisdt <- complete.cases(trimmedIrisdt)
我想这达到了目标:
irisdt[, keep :=
as.logical(do.call(pmin, lapply(.SD, function(x) abs(scale(x)) <= 2)))
, .SDcols = myCols]
res = irisdt[(keep), !"keep"]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
135: 6.7 3.0 5.2 2.3 virginica
136: 6.3 2.5 5.0 1.9 virginica
137: 6.5 3.0 5.2 2.0 virginica
138: 6.2 3.4 5.4 2.3 virginica
139: 5.9 3.0 5.1 1.8 virginica
如果有分组变量,这应该也能正常工作。我不知道它的统计可靠性。
工作原理:
- 测试每个单元格的
abs(scale(x)) <= 2
。 - 如果跨列的最小结果为 TRUE,则保留该行。
逐个查看其工作原理...
library(data.table)
mynewCols = paste0(myCols,"_outly")
irisdt[, (mynewCols) :=
lapply(.SD, function(x) replace(x, abs(scale(x)) <= 2, NA))
, .SDcols = myCols]
然后浏览View(irisdt[rowSums(!is.na(irisdt[, ..mynewCols])) > 0])
。