是否有一个 R 函数可以按组连续地估算缺失的年份值?

Is there an R function for imputing missing year values, consecutively, by group?

我的数据框看起来像:

df <- data.frame(ID=c("A", "A", "A", "A", 
                      "B", "B", "B", "B",
                      "C", "C", "C", "C",
                      "D", "D", "D", "D"),
                 grade=c("KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03"),
                 year=c(2002, 2003, NA, 2005,
                        2007, NA, NA, 2010,
                        NA, 2005, 2006, NA,
                        2009, 2010, NA, NA))

我希望能够通过 ID 估算缺失的 year 值,并获得以下预期结果:

wanted_df <- data.frame(ID=c("A", "A", "A", "A", 
                             "B", "B", "B", "B",
                             "C", "C", "C", "C",
                             "D", "D", "D", "D"),
                       grade=c("KG", "01", "02", "03",
                               "KG", "01", "02", "03",
                               "KG", "01", "02", "03",
                               "KG", "01", "02", "03"),
                       year=c(2002, 2003, 2004, 2005,
                              2007, 2008, 2009, 2010,
                              2004, 2005, 2006, 2007,
                              2009, 2010, 2011, 2012))

我尝试使用以下方法估算值:

都没有用。任何帮助将不胜感激。谢谢。

我们可以用na_interpolate/na_extrapolate

library(dplyr)
# remotes::install_github("skgrange/threadr")
library(threadr)
df %>% 
   group_by(ID) %>% 
   mutate(year = na_extrapolate(na_interpolate(year))) %>%
   ungroup

-输出

# A tibble: 16 × 3
   ID    grade  year
   <chr> <chr> <dbl>
 1 A     KG    2002 
 2 A     01    2003 
 3 A     02    2004 
 4 A     03    2005 
 5 B     KG    2007 
 6 B     01    2008 
 7 B     02    2009 
 8 B     03    2010 
 9 C     KG    2004.
10 C     01    2005 
11 C     02    2006 
12 C     03    2007 
13 D     KG    2009 
14 D     01    2010 
15 D     02    2011 
16 D     03    2012.

可以这样做:

year_imputer <- function(years){
    # Find one non-missing data-point
    ref_indx <- which(!is.na(years))[1]
    # Make it the reference point
    ref <- years[ref_indx]
    # Get the length of the years
    years_len <- length(years)
    # Generate the sequence
    (ref - (ref_indx - 1)):(ref+(years_len - ref_indx))
}
library(dplyr)
df %>% 
    group_by(ID) %>% 
    mutate(
        year = year_imputer(year)
    ) %>% 
    ungroup()

输出:

# A tibble: 16 x 3
   ID    grade  year
   <chr> <chr> <int>
 1 A     KG     2002
 2 A     01     2003
 3 A     02     2004
 4 A     03     2005
 5 B     KG     2007
 6 B     01     2008
 7 B     02     2009
 8 B     03     2010
 9 C     KG     2004
10 C     01     2005
11 C     02     2006
12 C     03     2007
13 D     KG     2009
14 D     01     2010
15 D     02     2011
16 D     03     2012

这不是我推荐的解决方案: 但是这个问题引发了另一个问题,看这里

library(dplyr)
df %>% 
  group_by(ID) %>% 
  mutate(year= ifelse(is.na(year), lag(year)+1, year),
         year= ifelse(is.na(year), lag(year)+1, year),
         year= ifelse(is.na(year), lead(year)-1, year))
   ID    grade  year
   <chr> <chr> <dbl>
 1 A     KG     2002
 2 A     01     2003
 3 A     02     2004
 4 A     03     2005
 5 B     KG     2007
 6 B     01     2008
 7 B     02     2009
 8 B     03     2010
 9 C     KG     2004
10 C     01     2005
11 C     02     2006
12 C     03     2007
13 D     KG     2009
14 D     01     2010
15 D     02     2011
16 D     03     2012

您可以针对您的特定问题使用任何(线性)插值函数。例如。来自 imputeTS 包,也来自 zoo 和其他包。但是,当您处理 POSIXct 数据类型而不仅仅是数字时,此解决方案可能不再适用。 同样值得注意的是,这仅在每个缺失的年份实际上作为 NA 值插入(而不是仅仅被遗漏)时才有效。对于这种称为隐式缺失值的情况(遗漏的年份),tsibble 包有一个名为 fill_gaps().

的函数
library("imputeTS")
library("dplyr")

df <- data.frame(ID=c("A", "A", "A", "A", 
                      "B", "B", "B", "B",
                      "C", "C", "C", "C",
                      "D", "D", "D", "D"),
                 grade=c("KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03"),
                 year=c(2002, 2003, NA, 2005,
                        2007, NA, NA, 2010,
                        NA, 2005, 2006, NA,
                        2009, 2010, NA, NA))

df %>% group_by(ID) %>% mutate(year = na_interpolation(year)) %>% ungroup

另一种解决方案:

library(dplyr)

df %>% 
  group_by(ID) %>% 
  mutate(year = c(min(year,na.rm = T) - (which.min(year)-1):0, 
                  min(year,na.rm = T) + 1:(n()-which.min(year)))) %>% ungroup

#> # A tibble: 16 × 3
#>    ID    grade  year
#>    <chr> <chr> <dbl>
#>  1 A     KG     2002
#>  2 A     01     2003
#>  3 A     02     2004
#>  4 A     03     2005
#>  5 B     KG     2007
#>  6 B     01     2008
#>  7 B     02     2009
#>  8 B     03     2010
#>  9 C     KG     2004
#> 10 C     01     2005
#> 11 C     02     2006
#> 12 C     03     2007
#> 13 D     KG     2009
#> 14 D     01     2010
#> 15 D     02     2011
#> 16 D     03     2012