R删除按因素分组的数据框中的异常值
R Remove outliers in a dataframe grouped by factor
我有一个数据框,其中包含按样本分组的 3 个参数的测量值:
ORD curv exp rep mu lam abs
1 Combi pH=7 Curva_F_Cor Exp_F Rep1 0.15637365 714.947.305 0.4990000
2 Combi pH=7 Curva_F_Cor Exp_F Rep10 0.12817901 6.797.925.883 0.4914276
3 Combi pH=7 Curva_F_Cor Exp_F Rep11 0.13392221 6.765.638.528 0.5261217
4 Combi pH=7 Curva_F_Cor Exp_F Rep2 0.09683254 6.671.151.868 0.4236507
5 Combi pH=7 Curva_F_Cor Exp_F Rep3 0.11249738 6.868.057.298 0.4899013
6 Combi pH=7 Curva_F_Cor Exp_F Rep4 0.10878719 6.829.856.006 0.4876704
7 Combi pH=7 Curva_F_Cor Exp_F Rep5 0.11019295 6.758.654.665 0.4871269
8 Combi pH=7 Curva_F_Cor Exp_F Rep6 0.12100511 6.733.007.508 0.4923079
9 Combi pH=7 Curva_F_Cor Exp_F Rep7 0.09803942 6.791.743.116 0.4185484
10 Combi pH=7 Curva_F_Cor Exp_F Rep8 0.13842086 6.909.115.228 0.5392007
11 Combi pH=7 Curva_F_Cor Exp_F Rep9 0.12778964 6.779.856.345 0.5475924
12 ORD0793 Curva_F_Cor Exp_F Rep1 0.13910441 7.051.072.489 0.4706000
13 ORD0793 Curva_F_Cor Exp_F Rep2 0.12603702 7.143.108.903 0.4436000
14 ORD0793 Curva_F_Cor Exp_F Rep3 0.12670842 6.989.806.663 0.4258000
15 ORD0795 Curva_F_Cor Exp_F Rep1 0.12982122 7.029.434.508 0.4996000
16 ORD0795 Curva_F_Cor Exp_F Rep2 0.13648100 6.776.386.442 0.4896000
17 ORD0795 Curva_F_Cor Exp_F Rep3 0.13593685 7.161.375.293 0.4766000
18 ORD0799 Curva_F_Cor Exp_F Rep1 0.13906691 7.065.198.206 0.4806000
19 ORD0799 Curva_F_Cor Exp_F Rep2 0.14822216 70.824.584 0.4640000
20 ORD0799 Curva_F_Cor Exp_F Rep3 0.10630870 6.669.130.811 0.4686809
21 ORD0839 Curva_F_Cor Exp_F Rep1 0.16717843 6.133.730.567 0.5458000
22 ORD0839 Curva_F_Cor Exp_F Rep2 0.09995048 7.119.564.022 0.4026000
23 ORD0839 Curva_F_Cor Exp_F Rep3 0.15911022 7.321.225.246 0.5118000
24 ORD0843 Curva_F_Cor Exp_F Rep1 0.12508123 6.579.839.732 0.5458217
25 ORD0843 Curva_F_Cor Exp_F Rep2 0.16396603 6.536.282.149 0.5210000
26 ORD0843 Curva_F_Cor Exp_F Rep3 0.15029945 7.015.299.122 0.4838000
27 ORD0847 Curva_F_Cor Exp_F Rep1 0.11697558 7.076.730.379 0.4148000
28 ORD0847 Curva_F_Cor Exp_F Rep2 0.15276497 7.181.749.575 0.5088000
29 ORD0847 Curva_F_Cor Exp_F Rep3 0.15533901 710.518.294 0.5348000
30 ORD0856 Curva_F_Cor Exp_F Rep1 0.11217122 7.940.648.197 0.4130000
31 ORD0856 Curva_F_Cor Exp_F Rep2 0.12010424 8.359.758.086 0.4446000
32 ORD0856 Curva_F_Cor Exp_F Rep3 0.13337373 811.057.251 0.4780000
我想删除 ORD 列中包含的每个样本的 mu lam 和 abs 异常值。
我在这个论坛找到了一个去除异常值的功能:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
但我只知道如何将它们应用于数字向量或使用 lapply 将函数应用于数据框的每一列,但我不知道如何将 tje 函数应用于数据框按样本分组。类似于 remove_outliers(mu~ORD, data=df, na.rm=TRUE)
感谢任何帮助
我们可以使用 dplyr
中的函数来实现这一点。您可能希望 group_by
将列作为组,并使用 mutate
来更新您的列。
library(dplyr)
您可以通过指定列名和函数来仅应用一列,如下所示。
# Apply the finction to one column
dt2 <- dt %>%
group_by(ORD) %>%
mutate(mu = remove_outliers(mu))
您还可以通过使用 mutate_at
并在 vars()
.
中指定多个列名称将其应用于多个列
# Apply the function to multiple columns
dt3 <- dt %>%
group_by(ORD) %>%
mutate_at(vars(mu, abs), funs(remove_outliers))
或者,在 base R 中考虑 by
,它创建了一个由列出的因素切片的数据帧列表,df$ORD
。之后,行将所有 df 元素绑定到一个已编译的数据框中。并使用 sapply
处理数字列上的函数:
dflist <- by(df, df$ORD, function(i){
i[c("mu","lam","abs")] <- sapply(i[c("mu","lam","abs")], remove_outliers)
return(i)
})
newdf <- do.call(rbind, dflist)
rownames(newdf) <- NULL
我有一个数据框,其中包含按样本分组的 3 个参数的测量值:
ORD curv exp rep mu lam abs
1 Combi pH=7 Curva_F_Cor Exp_F Rep1 0.15637365 714.947.305 0.4990000
2 Combi pH=7 Curva_F_Cor Exp_F Rep10 0.12817901 6.797.925.883 0.4914276
3 Combi pH=7 Curva_F_Cor Exp_F Rep11 0.13392221 6.765.638.528 0.5261217
4 Combi pH=7 Curva_F_Cor Exp_F Rep2 0.09683254 6.671.151.868 0.4236507
5 Combi pH=7 Curva_F_Cor Exp_F Rep3 0.11249738 6.868.057.298 0.4899013
6 Combi pH=7 Curva_F_Cor Exp_F Rep4 0.10878719 6.829.856.006 0.4876704
7 Combi pH=7 Curva_F_Cor Exp_F Rep5 0.11019295 6.758.654.665 0.4871269
8 Combi pH=7 Curva_F_Cor Exp_F Rep6 0.12100511 6.733.007.508 0.4923079
9 Combi pH=7 Curva_F_Cor Exp_F Rep7 0.09803942 6.791.743.116 0.4185484
10 Combi pH=7 Curva_F_Cor Exp_F Rep8 0.13842086 6.909.115.228 0.5392007
11 Combi pH=7 Curva_F_Cor Exp_F Rep9 0.12778964 6.779.856.345 0.5475924
12 ORD0793 Curva_F_Cor Exp_F Rep1 0.13910441 7.051.072.489 0.4706000
13 ORD0793 Curva_F_Cor Exp_F Rep2 0.12603702 7.143.108.903 0.4436000
14 ORD0793 Curva_F_Cor Exp_F Rep3 0.12670842 6.989.806.663 0.4258000
15 ORD0795 Curva_F_Cor Exp_F Rep1 0.12982122 7.029.434.508 0.4996000
16 ORD0795 Curva_F_Cor Exp_F Rep2 0.13648100 6.776.386.442 0.4896000
17 ORD0795 Curva_F_Cor Exp_F Rep3 0.13593685 7.161.375.293 0.4766000
18 ORD0799 Curva_F_Cor Exp_F Rep1 0.13906691 7.065.198.206 0.4806000
19 ORD0799 Curva_F_Cor Exp_F Rep2 0.14822216 70.824.584 0.4640000
20 ORD0799 Curva_F_Cor Exp_F Rep3 0.10630870 6.669.130.811 0.4686809
21 ORD0839 Curva_F_Cor Exp_F Rep1 0.16717843 6.133.730.567 0.5458000
22 ORD0839 Curva_F_Cor Exp_F Rep2 0.09995048 7.119.564.022 0.4026000
23 ORD0839 Curva_F_Cor Exp_F Rep3 0.15911022 7.321.225.246 0.5118000
24 ORD0843 Curva_F_Cor Exp_F Rep1 0.12508123 6.579.839.732 0.5458217
25 ORD0843 Curva_F_Cor Exp_F Rep2 0.16396603 6.536.282.149 0.5210000
26 ORD0843 Curva_F_Cor Exp_F Rep3 0.15029945 7.015.299.122 0.4838000
27 ORD0847 Curva_F_Cor Exp_F Rep1 0.11697558 7.076.730.379 0.4148000
28 ORD0847 Curva_F_Cor Exp_F Rep2 0.15276497 7.181.749.575 0.5088000
29 ORD0847 Curva_F_Cor Exp_F Rep3 0.15533901 710.518.294 0.5348000
30 ORD0856 Curva_F_Cor Exp_F Rep1 0.11217122 7.940.648.197 0.4130000
31 ORD0856 Curva_F_Cor Exp_F Rep2 0.12010424 8.359.758.086 0.4446000
32 ORD0856 Curva_F_Cor Exp_F Rep3 0.13337373 811.057.251 0.4780000
我想删除 ORD 列中包含的每个样本的 mu lam 和 abs 异常值。
我在这个论坛找到了一个去除异常值的功能:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
但我只知道如何将它们应用于数字向量或使用 lapply 将函数应用于数据框的每一列,但我不知道如何将 tje 函数应用于数据框按样本分组。类似于 remove_outliers(mu~ORD, data=df, na.rm=TRUE)
感谢任何帮助
我们可以使用 dplyr
中的函数来实现这一点。您可能希望 group_by
将列作为组,并使用 mutate
来更新您的列。
library(dplyr)
您可以通过指定列名和函数来仅应用一列,如下所示。
# Apply the finction to one column
dt2 <- dt %>%
group_by(ORD) %>%
mutate(mu = remove_outliers(mu))
您还可以通过使用 mutate_at
并在 vars()
.
# Apply the function to multiple columns
dt3 <- dt %>%
group_by(ORD) %>%
mutate_at(vars(mu, abs), funs(remove_outliers))
或者,在 base R 中考虑 by
,它创建了一个由列出的因素切片的数据帧列表,df$ORD
。之后,行将所有 df 元素绑定到一个已编译的数据框中。并使用 sapply
处理数字列上的函数:
dflist <- by(df, df$ORD, function(i){
i[c("mu","lam","abs")] <- sapply(i[c("mu","lam","abs")], remove_outliers)
return(i)
})
newdf <- do.call(rbind, dflist)
rownames(newdf) <- NULL