用前一个日期的值替换 NA

Replace NA with values from previous date

我有像这样的日期数据框,大约有 100 万行

  id       date   variable
1  1 2015-01-01         NA
2  1 2015-01-02 -1.1874087
3  1 2015-01-03 -0.5936396
4  1 2015-01-04 -0.6131957
5  1 2015-01-05  1.0291688
6  1 2015-01-06 -1.5810152

可重现的例子在这里:

#create example data set
Df <- data.frame(id = factor(rep(1:3, each = 10)), 
     date = rep(seq.Date(from = as.Date('2015-01-01'), 
             to = as.Date('2015-01-10'), by = 1),3),
     variable = rnorm(30))
Df$variable[c(1,7,12,18,22,23,29)] <- NA

我想要做的是将 variable 中的 NA 值替换为每个 id 之前日期的值。我创建了循环,但运行速度很慢(您可以在下面找到它)。您能否为这项任务建议快速替代方案。谢谢!

library(dplyr)

#create new variable
Df$variableNew <- Df$variable
#create row numbers vector
Df$n <- 1:dim(Df)[1]
#order data frame by date
Df <- arrange(Df, date)


for (id in levels(Df$id)){
    I <- Df$n[Df$id == id] # create vector of rows for specific id

    for (row in 1:length(I)){ #if variable == NA for the first date change it to mean value
        if (is.na(Df$variableNew[I[1]])) {
            Df$variableNew[I[row]] <- mean(Df$variable,na.rm = T)
        }
        if (is.na(Df$variableNew[I[row]])){ # if variable == NA fassign to this date value from previous date
            Df$variableNew[I[row]] <- Df$variableNew[I[row-1]]
        }
    }
}

如果您获得 tidyr(0.3.0) available on github 的开发版本,则有一个函数 fill 可以完全执行此操作:

#devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
Df %>% group_by(id) %>% 
       fill(variable)

它不会执行第一个值 - 我们可以通过变异和替换来做到这一点:

Df %>% group_by(id) %>%
       mutate(variable = ifelse(is.na(variable) & row_number()==1, 
                                replace(variable, 1, mean(variable, na.rm = TRUE)),
                                variable)) %>% 
       fill(variable)

这个 data.table 解决方案应该非常快。

library(zoo)         # for na.locf(...)
library(data.table)
setDT(Df)[,variable:=na.locf(variable, na.rm=FALSE),by=id]
Df[,variable:=if (is.na(variable[1])) c(mean(variable,na.rm=TRUE),variable[-1]) else variable,by=id]
Df
#     id       date     variable
#  1:  1 2015-01-01 -0.288720759
#  2:  1 2015-01-02 -0.005344028
#  3:  1 2015-01-03  0.707310667
#  4:  1 2015-01-04  1.034107735
#  5:  1 2015-01-05  0.223480415
#  6:  1 2015-01-06 -0.878707613
#  7:  1 2015-01-07 -0.878707613
#  8:  1 2015-01-08 -2.000164945
#  9:  1 2015-01-09 -0.544790740
# 10:  1 2015-01-10 -0.255670709
# ...

因此,这将使用 id 的 locf 替换所有嵌入的 NA,然后进行第二次传递,用 variable 的平均值替换任何前导 NA id。请注意,如果您这样做是相反的顺序,您可能会得到不同的答案。