从分组数据中删除离群值
Remove outlier from grouped data
我有一个数据框如下:
ID Value
A 70
A 80
B 75
C 10
B 50
A 100
C 60
.. ..
我想按 ID 对这些数据进行分组,从分组数据(我们从箱线图中看到的那些)中删除离群值,然后计算平均值。
到目前为止,我已经完成了以下工作:
#Summary before removing outliers
summaryBy(Value ~ ID, data = df, FUN = c(mean, median, sd))
df_quantile = do.call("rbind", tapply(df$Value, df$ID, quantile))
filtered = function(x) {
lowerq = quantile(x)[2]
upperq = quantile(x)[4]
iqr = upperq - lowerq
mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)
extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)
x = x[x > extreme.threshold.lower & x < extreme.threshold.upper]
return(x)
}
filtData = tapply(df$Value, df$ID, filtered)
删除异常值后,如何在 filtData
上应用均值、sd
由于您提供的数据在箱线图中没有异常值,我使用了一些 R 数据:
您可以保存箱线图、获取离群值、移除它们并重新绘制,或计算每组的平均值。
n <- boxplot(count ~ spray, data = InsectSprays, boxwex=0.25)
InsectSprays_without_outlier <- InsectSprays[-which(InsectSprays$count %in% n$out & InsectSprays$spray %in% c("C","D")), ]
boxplot(count ~ spray, data = InsectSprays_without_outlier, add=T, col=2, at =1:nlevels(InsectSprays$spray) + 0.2, boxwex=0.25)
# mean value per group
aggregate(count ~ spray, data = InsectSprays_without_outlier, mean)
编辑: 更通用的解决方案。
一定有更优雅的方法,但你可以试试这个:
# the boxplot to get the stat
n <- boxplot(count ~ spray, data = InsectSprays,boxwex=0.25)
# make a list of your data per group
a <- split(InsectSprays, InsectSprays$spray)
# Go through the list and exclude the outliers
a <- lapply(1:nlevels(InsectSprays$spray), function(i,x)
subset(x[[i]], count <= n$stats[5, i] & count >= n$stats[1, i]), a)
# Transform to a data.frame again
InsectSprays_without_outlier <- do.call(rbind, a)
我有一个数据框如下:
ID Value
A 70
A 80
B 75
C 10
B 50
A 100
C 60
.. ..
我想按 ID 对这些数据进行分组,从分组数据(我们从箱线图中看到的那些)中删除离群值,然后计算平均值。
到目前为止,我已经完成了以下工作:
#Summary before removing outliers
summaryBy(Value ~ ID, data = df, FUN = c(mean, median, sd))
df_quantile = do.call("rbind", tapply(df$Value, df$ID, quantile))
filtered = function(x) {
lowerq = quantile(x)[2]
upperq = quantile(x)[4]
iqr = upperq - lowerq
mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)
extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)
x = x[x > extreme.threshold.lower & x < extreme.threshold.upper]
return(x)
}
filtData = tapply(df$Value, df$ID, filtered)
删除异常值后,如何在 filtData
由于您提供的数据在箱线图中没有异常值,我使用了一些 R 数据: 您可以保存箱线图、获取离群值、移除它们并重新绘制,或计算每组的平均值。
n <- boxplot(count ~ spray, data = InsectSprays, boxwex=0.25)
InsectSprays_without_outlier <- InsectSprays[-which(InsectSprays$count %in% n$out & InsectSprays$spray %in% c("C","D")), ]
boxplot(count ~ spray, data = InsectSprays_without_outlier, add=T, col=2, at =1:nlevels(InsectSprays$spray) + 0.2, boxwex=0.25)
# mean value per group
aggregate(count ~ spray, data = InsectSprays_without_outlier, mean)
编辑: 更通用的解决方案。 一定有更优雅的方法,但你可以试试这个:
# the boxplot to get the stat
n <- boxplot(count ~ spray, data = InsectSprays,boxwex=0.25)
# make a list of your data per group
a <- split(InsectSprays, InsectSprays$spray)
# Go through the list and exclude the outliers
a <- lapply(1:nlevels(InsectSprays$spray), function(i,x)
subset(x[[i]], count <= n$stats[5, i] & count >= n$stats[1, i]), a)
# Transform to a data.frame again
InsectSprays_without_outlier <- do.call(rbind, a)