通过小鼠进行的多元确定性回归插补导致结果不稳定

Multivariate deterministic regression imputation via mice leads to unstable results

mice R 包通过指定 method = "norm.predict" 提供确定性回归插补。由于确定性回归插补的性质,即没有噪声被添加到插补值,我希望插补值始终相同,无论我使用哪个种子。对于单变量缺失,这似乎有效。但是,当我估算多变量缺失值时,我发现了不一致之处。下面用一个可重现的例子说明了这个问题:

library("mice")

# Example 1: Univariate missings (works fine)
data1 <- data.frame(x1 = c(NA, NA, NA, 8, 5, 1, 7, 4),
                    x2 = c(2, 13, 12, 5, 6, 6, 1, 2),
                    x3 = c(4, 7, 4, 5, 1, 2, 7, 3))

# Impute univariate missings
imp <- mice(data1, method = "norm.predict", m = 1)
complete(imp) # Always the same result


# Example 2: Multivariate missings (leads to inconsistent imputations)
data2 <- data1
data2[4, 2] <- NA

# Impute multivariate missings
imp1 <- mice(data2, method = "norm.predict", m = 1, seed = 111)
imp2 <- mice(data2, method = "norm.predict", m = 1, seed = 222)

# Results are different
complete(imp1)
complete(imp2)

问题:为什么小鼠的多元确定性回归插补不一致?

?mice看一下data.init参数的描述:

data.init A data frame of the same size and type as data, without missing data, used to initialize imputations before the start of the iterative process. The default NULL implies that starting imputation are created by a simple random draw from the data. Note that specification of data.init will start the m Gibbs sampling streams from the same imputations.

这就是随机性的来源。不是来自 norm.predict 方法本身,正如您所说,它是完全确定的。 (您可以通过在控制台键入 mice.impute.norm.predict 来查看确认方法)。

所以为了避免随机抽样,我们必须提供 micedata.init:

data.init = data2
for (i in 1:ncol(data.init)) data.init[, i][is.na(data.init[, i])] = 1

imp1 <- mice(data2, method = "norm.predict", m = 1, data.init = data.init, seed = 111)
imp2 <- mice(data2, method = "norm.predict", m = 1, data.init = data.init, seed = 222)

# Results are the same
complete(imp1)
complete(imp2)