将 NA 替换为基于另一个变量的最近值，同时保留 NA 用于没有非缺失邻居的观察

Question

这里我有一个看起来像这样的数据：

year <- c(2000,2001,2002,2003,2005,2006,2007,2008,2009,2010)
x <- c(1,2,3,NA,5,NA,NA,NA,9,10)
dat <- data.frame(year, x)

我想根据年份变量用最近邻替换NA

例如，数据的第四位（第一个NA）取其左邻居而不是右邻居的值，因为它的年份“2003”更接近“2002”而不是“ 2005 年

我想把 NA 留在那里，因为它没有最近的非 NA 邻居。

例如，数据的第七位（第三个NA）仍然是NA，因为它没有非NA邻居。

插补后，得到的x应该是1, 2, 3, 3, 5, 5, NA, 9, 9, 10

Answer 1

一个选择是使用 tidyverse 中的 case_when。本质上，如果前一行的年份更近并且不是 NA，那么该行的 return x。如果没有，则选择下面的行。或者，如果年份更接近上方但有一个 NA，则 return 下面的行。然后，如果下面的行有更近的年份，但有 NA，那么上面的行是 return。如果某行没有 NA，那么只有 return x.

library(tidyverse)

dat %>%
  mutate(x = case_when(is.na(x) & !is.na(lag(x)) & year - lag(year) < lead(year) - year ~ lag(x),
                       is.na(x) & !is.na(lead(x)) & year - lag(year) > lead(year) - year ~ lead(x),
                       is.na(x) & is.na(lag(x)) ~ lead(x),
                       is.na(x) & is.na(lead(x)) ~ lag(x),
                       TRUE ~ x))

输出

Answer 2

使用方法 imap():

library(tidyverse)

dat %>%
  mutate(new = imap_dbl(x, ~ {
    if(is.na(.x)) {
      dist <- abs(year[-.y] - year[.y])
      res <- x[-.y][dist == min(dist, na.rm = TRUE)]
      if(all(is.na(res))) NA else na.omit(res)
    } else .x
  }))

#    year  x new
# 1  2000  1   1
# 2  2001  2   2
# 3  2002  3   3
# 4  2003 NA   3
# 5  2005  5   5
# 6  2006 NA   5
# 7  2007 NA  NA
# 8  2008 NA   9
# 9  2009  9   9
# 10 2010 10  10

Answer 3

一种data.table方法

library(data.table)
setDT(dat)
# first or last NA in a sequence of NA's?
# we need to convert these back to NA later in the process
dat[is.na(x) & is.na(shift(x, type = "lag")) & is.na(shift(x, type = "lead")), excl := "1"]
# rolling self-join on x
dat[is.na(x), x := dat[!is.na(x), ][.SD, x, on = .(year), roll = "nearest"]]
# set x back to NA if needed, remove the excl column
dat[excl == 1, x := NA][, excl := NULL][]
#    year  x
# 1: 2000  1
# 2: 2001  2
# 3: 2002  3
# 4: 2003  3
# 5: 2005  5
# 6: 2006  5
# 7: 2007 NA
# 8: 2008  9
# 9: 2009  9
#10: 2010 10

将 NA 替换为基于另一个变量的最近值，同时保留 NA 用于没有非缺失邻居的观察

Replace NA with the nearest value based on another variable, while keeping NA for observation which doesn't have non-missing neighbour

r

nearest-neighbor

missing-data

na