通过 id 有效地填充 data.table 中的数字变量

Question

作为 i 中 data.table 子集逻辑的自然结果，我经常遇到这样的情况：我为 id 定义了一部分变量（例如“2007 年之前的全面经济危机”每个国家”被计算为 < 2007 年的数据，因此以后的任何数据都为 NA）。这是一个稍微更一般的例子：

library("data.table")
Data <- data.table(id = rep(c(1,2,3), each = 4),
                   variable =c(3,3,NA,NA,NA,NA,4,NA,NA,NA,NA,NA))

当我随后需要在整个数据集上定义这个变量时，我想按组填充 NA。我通常使用 max by group:

Data[, variable_full := max(variable, na.rm = T), by = id]
Data[variable_full == -Inf, variable_full := NA] # this just overwrites the result of the warning

但是，无论出于何种原因，这在大型数据集中都需要很长时间。有没有更高效、更 data.table 的方式来做到这一点？

编辑：“大型数据集”目前有 800 万个观察值，它会停止我的工作流程，因为它需要几分钟。其他 data.table 操作只需几秒钟，因为 data.table 太棒了。

Answer 1

也许加入？

Data[, variable_full := variable]
Data[is.na(variable), variable_full := Data[!is.na(variable), 
                                       max(variable), 
                                       by = .(id)][Data[is.na(variable), ], V1, on = .(id)]][]

带有连接的行的（稍微）更短版本是

Data[is.na(variable), variable_full := Data[!is.na(variable), max(variable), by = .(id)][.SD, V1, on = .(id)]]

此处，[Data[is.na(variable), ], 部分已替换为 [.SD, ，因为它已经从 i 派生（在行首）...

Answer 2

如果你想安装崩溃包，我觉得这样会更快：

library(collapse)
Data = Data |> gby(id) |> fmutate(variable_full=fmax(variable)) |> setDT()

gby 是 'group by'，fmutate 是 'fast mutate'。默认输出是一个分组数据框，所以它需要 'setDT' 最后

通过 id 有效地填充 data.table 中的数字变量

fill up a numeric variable in data.table efficiently by id

r

data.table