如何使用鼠标对纵向数据中的缺失值进行多重插补?

How to use mice for multiple imputation of missing values in longitudinal data?

我有一个数据集,其中包含重复测量的连续结果和一些不同 类 的协变量,如下例所示。

 Id    y       Date          Soda    Team
 1    -0.4521  1999-02-07    Coke    Eagles 
 1     0.2863  1999-04-15    Pepsi   Raiders
 2     0.7956  1999-07-07    Coke    Raiders
 2    -0.8248  1999-07-26    NA      Raiders 
 3     0.8830  1999-05-29    Pepsi   Eagles 
 4     0.1303  2005-03-04    NA      Cowboys
 5     0.1375  2013-11-02    Coke    Cowboys
 5     0.2851  2015-06-23    Coke    Eagles 
 5    -0.3538  2015-07-29    Pepsi   NA 
 6     0.3349  2002-10-11    NA      NA
 7    -0.1756  2005-01-11    Pepsi   Eagles
 7     0.5507  2007-10-16    Pepsi   Cowboys
 7     0.5132  2012-07-13    NA      Cowboys
 7    -0.5776  2017-11-25    Coke    Cowboys
 8     0.5486  2009-02-08    Coke    Cowboys

我正在尝试使用 mice 包来乘以 SodaTeam 中的缺失值。据我了解,因为MI不是因果模型,所以没有因变量和自变量的概念。我不确定如何使用 mice 设置此 MI 进程。我喜欢其他人在像这样的重复测量设置中遇到缺失数据的一些建议或建议,以及他们如何使用鼠标来解决这个问题。提前致谢。

编辑

这是我迄今为止尝试过的方法,但这并没有捕获数据集的重复测量部分。

library(mice)

init = mice(dat, maxit=0)

methd = init$method

predM = init$predictorMatrix

methd [c("Soda")]="logreg"; 
methd [c("Team")]="logreg";  

imputed = mice(data, method=methd , predictorMatrix=predM, m=5)

有多种选择可以实现您的要求。我决定以所谓的 'wide' 格式估算协变量中的缺失值。我将通过以下工作示例来说明这一点,您可以轻松地将其应用于您自己的数据。

我们先做一个reprex。在这里,我使用 JM 包中附带的纵向 Mayo Clinic 原发性胆汁性肝硬化数据 (pbc2)。此数据以所谓的 'long' 格式组织,这意味着每个患者 i 有多个行,每行包含按时测量的变量 x 的测量值j。您的数据集也是长格式的。在这个例子中,我假设 pbc2$serBilir 是我们的结果变量。

# install.packages('JM')
library(JM)

# note: use function(x) instead of \(x) if you use a version of R <4.1.0

# missing values per column
miss_abs <- \(x) sum(is.na(x)) 
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
             Number    %
ascites          60  3.1
hepatomegaly     61  3.1
spiders          58  3.0
serChol         821 42.2
alkaline         60  3.1
platelets        73  3.8

根据此输出,pbc2 中的 6 个变量至少包含一个缺失值。让我们从中选择 alkaline。我们还需要患者 id 和时间变量 years.

# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))

# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
  id    years alkaline serBilir
1  1  1.09517     1718     14.5
2  1  1.09517     1612     21.3
3  2 14.15234     7395      1.1
4  2 14.15234     2107      0.8
5  2 14.15234     1711      1.0

通过快速观察,我们观察到 years 在受试者中似乎没有差异,即使重复测量了变量。为了这个例子,让我们为 years 的所有行添加一点时间,但第一次测量。

set.seed(1)

# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
  add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
  c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)

# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))

# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)

因为我们将以宽格式执行多重插补(这意味着每个参与者 i 有一行并且 x 上的重复测量存储在 j个不同的列,所以总共xj列),我们要把数据从长数据转换成宽数据。现在我们已经准备好数据,我们可以使用 reshape 为我们做这件事。

# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
                    idvar = 'id',
                    timevar = "measurement_number",
                    v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
   id   years.1 alkaline.1 serBilir.1   years.2 alkaline.2 serBilir.2   years.3 alkaline.3
1   1  1.095170       1718       14.5  1.938557       1612       21.3        NA         NA
3   2 14.152338       7395        1.1 15.198249       2107        0.8 15.943431       1711
12  3  2.770781        516        1.4  3.694434        353        1.1  5.148726        218
16  4  5.270507       6122        1.8  6.115197       1175        1.6  6.716832       1157

现在让我们乘以协变量中的缺失值。

library(mice)

# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence

# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
     grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
     grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
     grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
     grep("1|2|3", colnames(pred), value = TRUE)] <- 0

# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0

# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs

如以下三个示例跟踪图所示(可以使用 plot(imp) 获得),算法收敛得很好。请参阅 Stef van Buuren 的书的 this section 了解更多关于收敛的信息.

现在我们需要将乘法插补数据(宽格式)转换回长格式,以便我们可以将其用于分析。我们还需要确保我们排除了结果变量 serBilir 缺失值的所有行,因为我们不想使用结果的估算值。

# need unlisted data
implong <- complete(imp, 'long', include = FALSE)

# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
  expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
  1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]

# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
                   reshape, direction = 'long',
                   varying = as.list(v_names),
                   v.names = names(v_names),
                   idvar = 'id', times = 1:4)

# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
                     varying = as.list(v_names),
                     v.names = names(v_names),
                     idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)

# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))

# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))

最后,使用 miceadds 包中的 datalist2midslonglist 转换回 mids

imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL