如何在删除异常值的情况下找到数据框中每一行的均值？

Question

我对 R 比较陌生，如果这是一个愚蠢的问题，我很抱歉。

我已经使用 readxl 个国家及其一段时间内各自的分数导入了一个数据框，这是该数据框的一部分：

structure(list(X1 = 2:5, Argent. = structure(1:4, .Label = c("Austra~", 
"Austria", "Belgium", "Benin"), class = "factor"), ARG = structure(1:4, .Label = c("AUS", 
"AUT", "BEL", "BEN"), class = "factor"), X0.165. = structure(4:1, .Label = c("-0.38~", 
"1.711~", "1.731~", "1.800~"), class = "factor"), X0.376. = structure(c(2L, 
4L, 3L, 1L), .Label = c("-0.22~", "1.682~", "1.838~", "1.872~"
), class = "factor"), X3.217. = structure(c(3L, 4L, 2L, 1L), .Label = c("-0.23~", 
"1.734~", "1.810~", "1.929~"), class = "factor"), X.0.28. = structure(c(2L, 
3L, 4L, 1L), .Label = c("-0.36~", "1.718~", "1.942~", "1.978~"
), class = "factor"), X.4.74. = structure(c(2L, 4L, 3L, 1L), .Label = c("-0.28~", 
"1.837~", "1.933~", "1.995~"), class = "factor"), X.5.75. = structure(c(4L, 
2L, 3L, 1L), .Label = c("-0.35~", "1.865~", "1.875~", "2.006~"
), class = "factor"), X.0.12. = structure(c(4L, 2L, 3L, 1L), .Label = c("-0.65~", 
"1.684~", "1.711~", "1.751~"), class = "factor"), X.4.55. = structure(c(2L, 
4L, 3L, 1L), .Label = c("-0.60~", "1.711~", "1.747~", "1.831~"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

   Country scode `1996` `1998` `2000` `2002` `2003` `2004` `2005` `2006`
   <chr>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 Argent~ ARG   0.165~ 0.376~ 3.217~ -0.28~ -4.74~ -5.75~ -0.12~ -4.55~
 2 Austra~ AUS   1.800~ 1.682~ 1.810~ 1.718~ 1.837~ 2.006~ 1.751~ 1.711~
 3 Austria AUT   1.731~ 1.872~ 1.929~ 1.942~ 1.995~ 1.865~ 1.684~ 1.831~
 4 Belgium BEL   1.711~ 1.838~ 1.734~ 1.978~ 1.933~ 1.875~ 1.711~ 1.747~
 5 Benin   BEN   -0.38~ -0.22~ -0.23~ -0.36~ -0.28~ -0.35~ -0.65~ -0.60~

我现在希望识别每一行中的离群值，然后删除它们，然后再为每一行取一个修剪后的平均分。

此方法似乎有效：

ARGgoveff <- as.numeric(wgidataset2[1,c(3:22)])
outlier <- boxplot(ARGgoveff, plot=FALSE)$out
outlier

#mean with outlier removed
ARGsum <- ARGgoveff - outlier
ARGsum <- sum(ARGsum)
ARGmean <- ARGsum/19 #there are 20 columns of observations - with 1 outlying observation removed, sum is divided by 19
ARGmean #-0.4572065

但是我需要对所有 64 个国家/地区重复此操作，然后我必须使用来自不同数据集的新分数再次重复...是否有更有效的方法，我是否遗漏了一些明显的东西？

提前致谢。

编辑：所有 22 列和前 5 行的示例数据。列名与 rename() 之前的原始 excel 数据集相同：

structure(list(`Government Effectiveness` = c("Argentina", "Australia", 
"Austria", "Belgium", "Benin"), ...2 = c("ARG", "AUS", "AUT", 
"BEL", "BEN"), ...3 = c("0.16569004952907562", "1.8005645275115967", 
"1.73164963722229", "1.7115436792373657", "-0.38056445121765137"
), ...9 = c("0.37648507952690125", "1.6824686527252197", "1.8727401494979858", 
"1.8385097980499268", "-0.22818222641944885"), ...15 = c("3.2177139073610306E-2", 
"1.8101874589920044", "1.9292815923690796", "1.7346757650375366", 
"-0.23424351215362549"), ...21 = c("-0.28035733103752136", "1.7189428806304932", 
"1.9426975250244141", "1.9780101776123047", "-0.36772772669792175"
), ...27 = c("-4.7438342124223709E-2", "1.8376433849334717", 
"1.9958460330963135", "1.9337741136550903", "-0.28266650438308716"
), ...33 = c("-5.7569071650505066E-2", "2.0067775249481201", 
"1.8653030395507813", "1.8758641481399536", "-0.35626286268234253"
), ...39 = c("-0.12421452254056931", "1.7512129545211792", "1.6845946311950684", 
"1.7115614414215088", "-0.65762680768966675"), ...45 = c("-4.5504238456487656E-2", 
"1.7119560241699219", "1.8310364484786987", "1.7471009492874146", 
"-0.60658884048461914"), ...51 = c("-1.5958933159708977E-2", 
"1.8255587816238403", "1.8701866865158081", "1.608859658241272", 
"-0.53269058465957642"), ...57 = c("-0.14681185781955719", "1.7939697504043579", 
"1.7808396816253662", "1.3897556066513062", "-0.46236807107925415"
), ...63 = c("-0.31826484203338623", "1.7057873010635376", "1.6665796041488647", 
"1.5713415145874023", "-0.5754820704460144"), ...69 = c("-0.16279000043869019", 
"1.7687559127807617", "1.8417626619338989", "1.5750259160995483", 
"-0.59468859434127808"), ...75 = c("-0.12005612999200821", "1.6959501504898071", 
"1.6177613735198975", "1.6563558578491211", "-0.54589664936065674"
), ...81 = c("-0.23857522010803223", "1.621440052986145", "1.575872540473938", 
"1.6024763584136963", "-0.51528728008270264"), ...87 = c("-0.27754887938499451", 
"1.6398694515228271", "1.5884771347045898", "1.6073391437530518", 
"-0.48944029211997986"), ...93 = c("-0.15913525223731995", "1.6071146726608276", 
"1.5725857019424438", "1.381594181060791", "-0.45681807398796082"
), ...99 = c("-7.5008101761341095E-2", "1.5645337104797363", 
"1.4791269302368164", "1.4384405612945557", "-0.61963582038879395"
), ...105 = c("0.16212919354438782", "1.5685094594955444", "1.5119049549102783", 
"1.3262134790420532", "-0.56419026851654053"), ...111 = c("0.14954197406768799", 
"1.5355139970779419", "1.4612606763839722", "1.1816927194595337", 
"-0.6482049822807312"), ...117 = c("2.5987762957811356E-2", "1.5964949131011963", 
"1.4533424377441406", "1.171748161315918", "-0.56598663330078125"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

Answer 1

检查这个。

mean_without_outlier_fun <- function(x) {
  outlier = boxplot(x, plot = F)$out
  # browser()
  vec_without_outlier = x[!x %in% outlier] # remove outlier values from the vector
  mean = sum(vec_without_outlier)/base::length(vec_without_outlier) # sum/count
  return(mean)
}

  df %>% pivot_longer(cols = -c(Id,Country,scode), names_to = "year", values_to = "value") %>%
  select(-year,-Id) %>% 
  group_by(Country, scode) %>% 
  summarise(mean_no_outlier = mean_without_outlier_fun(as.numeric(value)))

假设

您的同名如下：

[1] "Id"      "Country" "scode"   "x1"     
 [5] "x2"      "x3"      "x4"      "x5"     
 [9] "x6"      "x7"      "x8"

注意： 至少列名称 Id、Country、scode 必须存在并且是强制性的。字段名称的其余部分可以是 anything.Also，如果您没有 Id 字段，那么，将其从代码中删除。

如果需要，请使用以下重命名字段。

names(df) <- c("Id", "Country", "scode", paste0("x",1:8))

如何在删除异常值的情况下找到数据框中每一行的均值？

How can I find means for each row in my dataframe with the outliers removed?

r

outliers

dataframe