调整用于移除异常值的代码 - 循环中的函数不是 运行

Adapting a code for removing outliers- Function not running in loop

本质上我是在尝试改编 and https://www.statology.org/remove-outliers-r/.

中的代码

第一个代码删除了高于第三个四分位数的异常值。我想改用四分位法。我还需要让它与我的数据集一起工作,即它需要嵌套到一个循环中,该循环通过名为 Group 的列对我的数据框的行进行子集化。我调整了上面的代码来为我工作,但我似乎无法让它与四分位数方法一起工作到 运行。请帮忙。

A-F组各有12行,这些是重复的,列是不同的时间点(分钟):

# A tibble: 72 x 15
   Group    `0`   `14`     `27`  `42`  `60`  `80`  `95` `110` `125` `139` `169` `200` `229` `311`
   <chr>  <dbl>  <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 A     -0.921 -0.717 -0.310   0.608  2.35  4.22  6.04  7.63  10.5  12.1  15.2  20.1  23.6  34.8
 2 A     -0.926 -0.728 -0.367   0.392  1.92  3.66  5.80  8.17  13.8  12.4  15.5  19.6  23.7  35.0
 3 A     -0.928 -0.699 -0.212   0.864  2.73  4.37  7.13  9.60  14.7  15.2  18.9  25.0  29.7  44.4
 4 A     -0.919 -0.590  0.0813  1.47   4.07  6.47  9.43 12.5   26.3  18.7  24.6  30.8  36.9  55.4
 5 A     -0.925 -0.672 -0.199   0.701  2.37  4.06  5.94  7.97  15.2  12.4  16.1  20.1  24.8  37.9
 6 A     -0.911 -0.563  0.185   1.76   4.99  7.98 11.8  15.2   31.8  25.7  32.2  41.3  48.4  73.4
 7 A     -0.918 -0.675 -0.0975  1.21   3.09  5.43  7.83 10.8   18.3  15.0  20.8  26.7  31.9  48.8
 8 A     -0.923 -0.646 -0.0773  1.24   3.51  5.57  7.48 10.1   15.5  16.3  21.2  25.2  30.4  46.7
 9 A     -0.926 -0.634 -0.00355 1.32   3.80  6.37  8.36 11.8   13.9  16.7  23.3  27.6  32.1  49.8
10 A     -0.920 -0.653 -0.0496  1.17   3.22  5.18  6.98  9.34  11.0  13.3  19.1  22.6  27.2  40.9
# … with 62 more rows

改编来自 的代码我制作了这个:

if numeric是为了处理组列,我本来也把所有的负值都设为NA。我最近改变了这个,但我想保留那个功能,以防我们恢复。)

Outlies = function(x, probs){
  if(class(x) == "numeric"){
    x[x > quantile(x, probs)] = NA
    return(x)
  }else{
    return(x)
  }
  
}


Group<- c('A', 'B', 'C', 'D','E','F')
Cleandata = NULL

for (char in Group){
  subset <- my_data[which(my_data$Group == char),]
  df <- do.call(bind_cols, lapply(subset, Outlies, 3/4))
  Cleandata <- bind_rows(Cleandata, df)
}

输出:

# A tibble: 72 x 15
   Group    `0`   `14`    `27`   `42`  `60`  `80`  `95` `110` `125` `139` `169` `200` `229` `311`
   <chr>  <dbl>  <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 A     -0.921 -0.717 -0.310   0.608  2.35  4.22  6.04  7.63  10.5  12.1  15.2  20.1  23.6  34.8
 2 A     -0.926 -0.728 -0.367   0.392  1.92  3.66  5.80  8.17  13.8  12.4  15.5  19.6  23.7  35.0
 3 A     -0.928 -0.699 -0.212   0.864  2.73  4.37  7.13  9.60  14.7  15.2  18.9  25.0  29.7  44.4
 4 A     -0.919 NA     NA      NA     NA    NA    NA    NA     NA    NA    NA    NA    NA    NA  
 5 A     -0.925 -0.672 -0.199   0.701  2.37  4.06  5.94  7.97  15.2  12.4  16.1  20.1  24.8  37.9
 6 A     NA     NA     NA      NA     NA    NA    NA    NA     NA    NA    NA    NA    NA    NA  
 7 A     NA     -0.675 -0.0975  1.21   3.09  5.43  7.83 10.8   NA    15.0  20.8  26.7  31.9  48.8
 8 A     -0.923 -0.646 -0.0773  1.24   3.51  5.57  7.48 10.1   15.5  16.3  21.2  25.2  30.4  46.7
 9 A     -0.926 NA     NA      NA     NA    NA    NA    NA     13.9  NA    NA    NA    NA    NA  
10 A     -0.920 -0.653 -0.0496  1.17   3.22  5.18  6.98  9.34  11.0  13.3  19.1  22.6  27.2  40.9
# … with 62 more rows

这 运行 完全符合我的要求,但我意识到我不喜欢这种去除异常值的方法,我想使用四分位数法,其中异常值“低于 Q1 − 1.5 IQR 或高于 Q3 + 1.5 IQR”我也尝试从这里改编一些代码https://www.statology.org/remove-outliers-r/

新代码和输出:

Outliesplease = function(x){
  Q1 <- quantile(x, probs=.25)
  Q3 <- quantile(x, probs=.75)
  iqr = Q3-Q1
  
  upper_limit = Q3 + (iqr*1.5)
  lower_limit = Q1 - (iqr*1.5)
  if(class(x) == "numeric"){
    x[x> upper_limit | x < lower_limit] = NA
    return(x)
  }else{
    return(x)
  }
}

> Outliesplease(D$`42`)
 [1]       NA 2.428675 2.428384 2.714187 2.457054 2.464337 2.686667 2.166072 2.690987 2.632692
[11]       NA 2.849850

但是这段代码在我在顶部使用的 for 循环中不起作用。我想是因为它不能只处理单列数组。

错误:

 Error in (1 - h) * qs[i] : non-numeric argument to binary operator 
5.
quantile.default(x, probs = 0.25) 
4.
quantile(x, probs = 0.25) 
3.
FUN(X[[i]], ...) 
2.
lapply(subset, Outliesplease) 
1.
do.call(bind_cols, lapply(subset, Outliesplease)) 

我试过制作新的嵌套循环,但到目前为止它们还没有奏效,而且变得很长。非常感谢任何帮助。

这个 returns 一个 tibble 但不产生和 NAs,x[[i]] 在我尝试完成这些行时给出错误下标越界。

Outliesplease = function(x){
  for(i in 1:ncol(x)){
    if(class(x[[i]]) == "numeric"){
      Q1 <- quantile(x, probs=.25)
      Q3 <- quantile(x, probs=.75)
      iqr = Q3-Q1
      
      upper_limit = Q3 + (iqr*1.5)
      lower_limit = Q1 - (iqr*1.5)
      x[x> upper_limit | x < lower_limit] = NA
      return(x)
    }else{
      return(x)
    }
  }
}

数据(dput( my_data[1:24,1:10]):

structure(list(Group = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B"), `0` = c(-0.920718681650864, -0.926284584901359, 
-0.928389338231378, -0.919206377406332, -0.925489455865574, -0.910748387172736, 
-0.917616119334762, -0.923166431819849, -0.925629772754242, -0.920438047873528, 
-0.92294036572144, -0.915745227485856, -1.69199983248014, -1.70459419320581, 
-1.70676758766494, -1.69535549688934, -1.70584725776368, -1.70666847521404, 
-1.70083499953221, -1.71102942305384, -1.70695165364519, -1.70062969516963, 
-1.7134930754049, -1.70824719496773), `14` = c(-0.717087694662867, 
-0.727938867386521, -0.699111542147963, -0.590022956591343, -0.671788725771233, 
-0.563409520040657, -0.674930265000854, -0.646204279737445, -0.634129231929299, 
-0.652767991974024, -0.700865503256312, -0.674283248236441, -1.38233713854987, 
-1.4058905045613, -1.331003968442, -1.32096529305751, -1.34241605921761, 
-1.30258701287546, -1.36194121204584, -1.38785911795742, -1.32276347609535, 
-1.41574511396555, -1.43304023664842, -1.43429330120629), `27` = c(-0.310028400637171, 
-0.367347849658025, -0.212126189260467, 0.0813387880058272, -0.198866243281347, 
0.184635404213543, -0.0975496542803897, -0.0773362269295024, 
-0.00354513425557315, -0.0496080506521766, -0.212157370791282, 
-0.145584802501048, -0.532695653169101, -0.741999911097547, -0.470785768657541, 
-0.64911738567826, -0.543272367572791, -0.522904758911869, -0.552964149379117, 
-0.612636924284542, -0.477362587721147, -0.634356709954235, -0.777630837197797, 
-0.788073041846688), `42` = c(0.607589483572235, 0.391672973443089, 
0.864400571365382, 1.47484919551526, 0.700635171524487, 1.76405009844259, 
1.2081067901575, 1.23822814892489, 1.32286261893977, 1.17345631403922, 
0.885775510739131, 1.23909343640501, 0.891691855547406, 0.549640628553635, 
1.24951612115659, 0.674047992821404, 0.89237856324296, 0.887670721824986, 
0.951194723394137, 0.768459681768936, 0.920901710721185, 0.816104452810994, 
0.6222900550668, 0.528862411167537), `60` = c(2.35269503717001, 
1.91507004756281, 2.7283623250476, 4.06823270417236, 2.37029701131513, 
4.99217264375453, 3.09496358284071, 3.51067575166759, 3.79890223175695, 
3.22287801762694, 2.83296856554955, 2.80565354455553, 3.05406359423973, 
2.59544904552122, 3.6861957266493, 2.78357863625995, 2.84691149238807, 
2.9002410704356, 3.33699424426883, 2.78563875934661, 3.24748154218031, 
3.03239336479549, 2.89073335460952, 2.42163412447745), `80` = c(4.22194985570809, 
3.66147743007211, 4.37329721090187, 6.4729211791019, 4.06143513045467, 
7.97675963801785, 5.43250263115993, 5.5739108734064, 6.37457463091108, 
5.1765490354642, 4.88228113377942, 4.80671269384903, 5.47211005896236, 
5.16329690032557, 6.82715424934945, 4.78752620164185, 5.19603940642802, 
5.0833981059748, 5.50798876618987, 5.42161934468718, 5.42710592679083, 
5.27515238063233, 5.28819274738708, 4.39058083576842), `95` = c(6.03690983371438, 
5.7998522456926, 7.12957303653705, 9.43214199804714, 5.94493990857525, 
11.8180436006606, 7.82973872022466, 7.47695867596535, 8.35890488892228, 
6.98265345871901, 6.67122012432051, 6.08213864416168, 6.50883337488658, 
7.03303204826091, 9.19775416793183, 6.64392364546973, 7.52055911480048, 
7.31507776569264, 8.43660594228742, 7.0626383532383, 7.77525687524329, 
7.60673031140136, 7.58134336504819, 5.40544277680737), `110` = c(7.63013794609452, 
8.16696697598995, 9.59730277139201, 12.4602350231802, 7.96996206430016, 
15.2471700889893, 10.8204762714413, 10.1465498459346, 11.785006768762, 
9.33682005834538, 9.48028628162565, 9.04492195300255, 8.75188972807613, 
9.4568624324399, 12.0358816763534, 8.95913386291792, 9.57352486661554, 
9.18373683558959, 11.3559065479999, 9.10912639844067, 9.55695184893212, 
9.52058465891086, 9.83057300803695, 7.86015376890397), `125` = c(10.5039681429026, 
13.8202876481251, 14.6520393918524, 26.3212361902925, 15.1860698793572, 
31.8006028973928, 18.2751539787646, 15.4837911355797, 13.9013596282443, 
10.9885525179173, 12.881450732196, 13.1384800907048, 8.97031233148782, 
10.4978971399785, 15.050280678709, 7.78937332003644, 9.42176246589807, 
10.7493595868453, 12.8327174637843, 10.1658562705261, 12.13019425285, 
11.1022919450563, 12.4059675680316, 10.0799753318171)), row.names = c(NA, 
-24L), class = c("tbl_df", "tbl", "data.frame"))

我今天早些时候想问一个问题,但我没有正确地提出问题。我打算删除它,但 Stack 建议不要这样做。抱歉,如果你之前试图帮助我

您可以编写一个函数来删除离群值,并按组仅将数值传递给它。使用 dplyr 您可以按如下方式执行此操作 -

Outliesplease = function(x){
  Q1 <- quantile(x, probs=.25)
  Q3 <- quantile(x, probs=.75)
  iqr = Q3-Q1
  upper_limit = Q3 + (iqr*1.5)
  lower_limit = Q1 - (iqr*1.5)
  x[x> upper_limit | x < lower_limit] = NA
  return(x)
}

library(dplyr)

my_data <- my_data %>% 
            group_by(Group) %>% 
            mutate(across(where(is.numeric), Outliesplease))