data.table 按组填充其他行的缺失值

data.table fill missing values from other rows by group

# have
> aDT <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA))
> aDT
    colA colB
 1:    1    4
 2:    1   NA
 3:    1   NA
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2   NA
 8:    2   NA
 9:    3    4
10:    3   NA
11:    3    2
12:    3   NA
# want
> bDT <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,1,1,1,4,3,3,3,4,2,2,2))
> bDT
    colA colB
 1:    1    4
 2:    1    1
 3:    1    1
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2    3
 8:    2    3
 9:    3    4
10:    3    2
11:    3    2
12:    3    2

想根据以下算法填充缺失值: 在每个组内 ('colA'),

  1. 使用下面一行的值,如果它仍然是 NA,则继续直到该组中的最后一行
  2. 如果下面的行中的所有 NA,请查看上面的行(一次向上 1 行)
  3. 如果所有 NA,则 NA

由于数据集很大,算法效率是考虑的一部分。不确定是否已经有此类操作的软件包。怎么做?

这是使用 tidyversezoo::na.locf 的一种方法:

library(tidyverse);
library(zoo);
df %>%
    group_by(colA) %>%
    arrange(colA) %>%
    mutate(colB = na.locf(colB, na.rm = F, fromLast = TRUE)) %>%
    mutate(colB = na.locf(colB, na.rm = F));
## A tibble: 12 x 2
## Groups:   colA [3]
#    colA  colB
#   <dbl> <dbl>
# 1  1.00  4.00
# 2  1.00  1.00
# 3  1.00  1.00
# 4  1.00  1.00
# 5  2.00  4.00
# 6  2.00  3.00
# 7  2.00  3.00
# 8  2.00  3.00
# 9  3.00  4.00
#10  3.00  2.00
#11  3.00  2.00
#12  3.00  2.00

或者data.table方式:

library(data.table);
dt[, .(na.locf(na.locf(colB, na.rm = F, fromLast = T), na.rm = F)), by = .(colA)];
#    colA V1
# 1:    1  4
# 2:    1  1
# 3:    1  1
# 4:    1  1
# 5:    2  4
# 6:    2  3
# 7:    2  3
# 8:    2  3
# 9:    3  4
#10:    3  2
#11:    3  2
#12:    3  2

这两种情况的关键是两次应用na.locf:首先从底部替换NA,然后从顶部替换剩余的NA


示例数据

# As data.frame
df <- data.frame(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA));
# As data.table
dt <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA));

data.tablezoo:

library(data.table)
library(zoo)

# Last observation carried forward from last row of group
dt <- dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA]

# Last observation carried forward for first row of group
dt[, colB := na.locf(colB), by = colA][]

或在单链中:

dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA][
   , colB := na.locf(colB), by = colA][]

两个return:

    colA colB
 1:    1    4
 2:    1    1
 3:    1    1
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2    3
 8:    2    3
 9:    3    4
10:    3    2
11:    3    2
12:    3    2

数据:

text <- "colA colB
    1    4
    1   NA
    1   NA
    1    1
    2    4
    2    3
    2   NA
    2   NA
    3    4
    3   NA
    3    2
    3   NA"

dt <- fread(input = text, stringsAsFactors = FALSE)
library(tidyverse)

aDT%>%group_by(colA)%>%fill(colB,.direction="up")%>%fill(colB)
# A tibble: 12 x 2
# Groups:   colA [3]
    colA  colB
   <dbl> <dbl>
 1     1     4
 2     1     1
 3     1     1
 4     1     1
 5     2     4
 6     2     3
 7     2     3
 8     2     3
 9     3     4
10     3     2
11     3     2
12     3     2