R:按组对 data.table 进行操作,去除异常值
R: Operations on data.table by group, removing outliers
我有一个 data.table,我想在其中检测异常值的存在(根据偏度和峰度),如果找到,则更正它们。
为此,当检测到异常值并且var
是最高值时,我想将var
中的最高值设置为等于第二高。下面是我的代码的一个最小(几乎)工作示例:
`%>%` <- fastpipe::`%>>%`
country <- rep(c("AA", "BB", "CC", "ZZ"), times = 4)
year <- rep(c("2014", "2015", "2016", "2017"), each = 4)
var <- c(NA, rnorm(8, 2, 4), NA, NA, 1, 25, 19, 2, 3)
melted_data <- data.table(country, year, var)
melted_data %>%
.[, skew := e1071::skewness(var, na.rm = TRUE), by=year] %>%
.[, kurt:= moments::kurtosis(var, na.rm = TRUE), by=year] %>%
.[, outliers := kurt>1 || kurt>3.5 & abs(skew)>2, by=year] %>%
# Ranks
.[, rank_high_first := as.integer(frank(-var, na.last="keep", ties.method="min")), by=year] %>%
.[, rank_low_first := as.integer(frank(var, na.last="keep", ties.method="min")), by=year] %>%
# Identify and correct outliers
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
.[rank_low_first==1, lowest1 := var, by=year] %>%
.[rank_low_first==2, lowest2 := var, by=year] %>%
.[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year]
我想要实现的都在最后一行。但是,这不起作用,因为值 highest1
和 highest2
不跨越整个 year
组(编辑:另请参见下面的屏幕截图)。我认为解决方案是修改以下行
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
以便将 highest1
和 highest2
复制到该年的所有行。我怎样才能做到这一点?我还尝试了以下方法,但没有用:
.[, highest1 := var[rank_high_first==1], by=year]
在识别和纠正异常值部分,我相信您可以进行以下更改:
f <- function(v,r,i) v[r==i & !is.na(r)]
melted_data[, `:=`(
highest1 = f(var,rank_high_first,1),highest2=f(var,rank_high_first,2),
lowest1 = f(var,rank_low_first,1),lowest2=f(var,rank_low_first,2)
),by=year]
此外,我想知道您对 outliers
的最初定义。是否应该像下面这样添加括号?:
.[, outliers := kurt>1 | (kurt>3.5 & abs(skew)>2), by=year]
你在最后两行看到的问题,我相信可以通过以下方式解决:
.[outliers==TRUE & skew>0 & var==highest1, var:=highest2] %>%
.[outliers==TRUE & skew<0 & var==lowest1, var:=lowest2]
注意:这里不需要by=year
,应该用:=
代替<-
我有一个 data.table,我想在其中检测异常值的存在(根据偏度和峰度),如果找到,则更正它们。
为此,当检测到异常值并且var
是最高值时,我想将var
中的最高值设置为等于第二高。下面是我的代码的一个最小(几乎)工作示例:
`%>%` <- fastpipe::`%>>%`
country <- rep(c("AA", "BB", "CC", "ZZ"), times = 4)
year <- rep(c("2014", "2015", "2016", "2017"), each = 4)
var <- c(NA, rnorm(8, 2, 4), NA, NA, 1, 25, 19, 2, 3)
melted_data <- data.table(country, year, var)
melted_data %>%
.[, skew := e1071::skewness(var, na.rm = TRUE), by=year] %>%
.[, kurt:= moments::kurtosis(var, na.rm = TRUE), by=year] %>%
.[, outliers := kurt>1 || kurt>3.5 & abs(skew)>2, by=year] %>%
# Ranks
.[, rank_high_first := as.integer(frank(-var, na.last="keep", ties.method="min")), by=year] %>%
.[, rank_low_first := as.integer(frank(var, na.last="keep", ties.method="min")), by=year] %>%
# Identify and correct outliers
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
.[rank_low_first==1, lowest1 := var, by=year] %>%
.[rank_low_first==2, lowest2 := var, by=year] %>%
.[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year]
我想要实现的都在最后一行。但是,这不起作用,因为值 highest1
和 highest2
不跨越整个 year
组(编辑:另请参见下面的屏幕截图)。我认为解决方案是修改以下行
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
以便将 highest1
和 highest2
复制到该年的所有行。我怎样才能做到这一点?我还尝试了以下方法,但没有用:
.[, highest1 := var[rank_high_first==1], by=year]
在识别和纠正异常值部分,我相信您可以进行以下更改:
f <- function(v,r,i) v[r==i & !is.na(r)]
melted_data[, `:=`(
highest1 = f(var,rank_high_first,1),highest2=f(var,rank_high_first,2),
lowest1 = f(var,rank_low_first,1),lowest2=f(var,rank_low_first,2)
),by=year]
此外,我想知道您对 outliers
的最初定义。是否应该像下面这样添加括号?:
.[, outliers := kurt>1 | (kurt>3.5 & abs(skew)>2), by=year]
你在最后两行看到的问题,我相信可以通过以下方式解决:
.[outliers==TRUE & skew>0 & var==highest1, var:=highest2] %>%
.[outliers==TRUE & skew<0 & var==lowest1, var:=lowest2]
注意:这里不需要by=year
,应该用:=
代替<-