R:按组对 data.table 进行操作,去除异常值

R: Operations on data.table by group, removing outliers

我有一个 data.table,我想在其中检测异常值的存在(根据偏度和峰度),如果找到,则更正它们。

为此,当检测到异常值并且var是最高值时,我想将var中的最高值设置为等于第二高。下面是我的代码的一个最小(几乎)工作示例:

`%>%` <- fastpipe::`%>>%`

country <- rep(c("AA", "BB", "CC", "ZZ"), times = 4)
year <- rep(c("2014", "2015", "2016", "2017"), each = 4)
var <- c(NA, rnorm(8, 2, 4), NA, NA, 1, 25, 19, 2, 3)

melted_data <- data.table(country, year, var)

melted_data %>%
  .[, skew := e1071::skewness(var, na.rm = TRUE), by=year] %>%
  .[, kurt:= moments::kurtosis(var, na.rm = TRUE), by=year] %>%
  .[, outliers := kurt>1 || kurt>3.5 & abs(skew)>2, by=year] %>%

  # Ranks
  .[, rank_high_first := as.integer(frank(-var, na.last="keep", ties.method="min")), by=year] %>%
  .[, rank_low_first := as.integer(frank(var, na.last="keep", ties.method="min")), by=year]  %>%

  # Identify and correct outliers
  .[rank_high_first==1, highest1 := var, by=year] %>%
  .[rank_high_first==2, highest2 := var, by=year] %>%
  .[rank_low_first==1, lowest1 := var, by=year] %>%
  .[rank_low_first==2, lowest2 := var, by=year] %>%
  .[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year]

我想要实现的都在最后一行。但是,这不起作用,因为值 highest1highest2 不跨越整个 year 组(编辑:另请参见下面的屏幕截图)。我认为解决方案是修改以下行

.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%

以便将 highest1highest2 复制到该年的所有行。我怎样才能做到这一点?我还尝试了以下方法,但没有用:

.[, highest1 := var[rank_high_first==1], by=year]

在识别和纠正异常值部分,我相信您可以进行以下更改:

f <- function(v,r,i) v[r==i & !is.na(r)]
melted_data[, `:=`(
  highest1 = f(var,rank_high_first,1),highest2=f(var,rank_high_first,2),
  lowest1 = f(var,rank_low_first,1),lowest2=f(var,rank_low_first,2)
),by=year]

此外,我想知道您对 outliers 的最初定义。是否应该像下面这样添加括号?:

  .[, outliers := kurt>1 | (kurt>3.5 & abs(skew)>2), by=year]

你在最后两行看到的问题,我相信可以通过以下方式解决:

  .[outliers==TRUE & skew>0 & var==highest1, var:=highest2] %>%
  .[outliers==TRUE & skew<0 & var==lowest1, var:=lowest2]

注意:这里不需要by=year,应该用:=代替<-