R 中 MI 数据的描述性统计:Take 3
Descriptive stats for MI data in R: Take 3
作为 R 初学者,我发现很难弄清楚如何计算乘法估算数据的描述性统计(比 运行 其他一些基本分析更难,例如相关性和回归) .
这些类型的问题以道歉(Descriptive statistics (Means, StdDevs) using multiply imputed data: R) but have have not been answered (https://stats.stackexchange.com/questions/296193/pooling-basic-descriptives-from-several-multiply-imputed-datasets-using-mice)开头,或者很快投反对票。
这里是 miceadds 函数的描述(https://www.rdocumentation.org/packages/miceadds/versions/2.10-14/topics/stats0),我发现很难理解以 mids 格式存储的数据。
我已经使用 summary(complete(imp)) 获得了一些输出,例如平均值、中位数、最小值、最大值,但很想知道如何获得额外的汇总输出(例如,skew/kurtosis、标准差, 方差).
从上一张海报借来的插图:
> imp <- mice(nhanes, seed = 23109)
iter imp variable
1 1 bmi hyp chl
1 2 bmi hyp chl
1 3 bmi hyp chl
1 4 bmi hyp chl
1 5 bmi hyp chl
2 1 bmi hyp chl
2 2 bmi hyp chl
2 3 bmi hyp chl
> summary(complete(imp))
age bmi hyp chl
1:12 Min. :20.40 1:18 Min. :113
2: 7 1st Qu.:24.90 2: 7 1st Qu.:186
3: 6 Median :27.40 Median :199
Mean :27.37 Mean :194
3rd Qu.:30.10 3rd Qu.:218
Max. :35.30 Max. :284
有人愿意花时间说明如何使用 mids 对象来获得基本描述吗?
以下是您可以执行的一些步骤,以更好地理解每一步之后 R 对象会发生什么。我还建议您查看本教程:
https://gerkovink.github.io/miceVignettes/
library(mice)
# nhanes object is just a simple dataframe:
data(nhanes)
str(nhanes)
#'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
#$ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
#$ hyp: num NA 1 1 NA 1 NA 1 1 1 NA ...
#$ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
# you can generate multivariate imputation using mice() function
imp <- mice(nhanes, seed=23109)
#The output variable is an object of class "mids" which you can explore using str() function
str(imp)
# List of 17
# $ call : language mice(data = nhanes)
# $ data :'data.frame': 25 obs. of 4 variables:
# ..$ age: num [1:25] 1 2 1 3 1 3 1 1 2 2 ...
# ..$ bmi: num [1:25] NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
# ..$ hyp: num [1:25] NA 1 1 NA 1 NA 1 1 1 NA ...
# ..$ chl: num [1:25] NA 187 187 NA 113 184 118 187 238 NA ...
# $ m : num 5
# ...
# $ imp :List of 4
#..$ age: NULL
#..$ bmi:'data.frame': 9 obs. of 5 variables:
#.. ..$ 1: num [1:9] 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
#.. ..$ 2: num [1:9] 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
#.. ..$ 3: num [1:9] 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
#.. ..$ 4: num [1:9] 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
#.. ..$ 5: num [1:9] 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
#...
#You can extract individual components of this object using $, for example
#To view the actual imputation for bmi column
imp$imp$bmi
# 1 2 3 4 5
# 1 28.7 27.2 22.5 27.2 28.7
# 3 30.1 30.1 30.1 22.0 28.7
# 4 22.7 27.2 20.4 22.7 20.4
# 6 24.9 25.5 22.5 21.7 21.7
# 10 30.1 29.6 27.4 25.5 25.5
# 11 35.3 26.3 22.0 27.2 22.5
# 12 27.5 26.3 26.3 24.9 22.5
# 16 29.6 30.1 27.4 30.1 25.5
# 21 33.2 30.1 35.3 22.0 22.7
# The above output is again just a regular dataframe:
str(imp$imp$bmi)
# 'data.frame': 9 obs. of 5 variables:
# $ 1: num 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
# $ 2: num 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
# $ 3: num 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
# $ 4: num 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
# $ 5: num 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
# complete() function returns imputed dataset:
mat <- complete(imp)
# The output of this function is a regular data frame:
str(mat)
# 'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
# $ bmi: num 28.7 22.7 30.1 22.7 20.4 24.9 22.5 30.1 22 30.1 ...
# $ hyp: num 1 1 1 2 1 2 1 1 1 1 ...
# $ chl: num 199 187 187 204 113 184 118 187 238 229 ...
# So you can run any descriptive statistics you need with this object
# Just like you would do with a regular dataframe:
> summary(mat)
# age bmi hyp chl
# Min. :1.00 Min. :20.40 Min. :1.00 Min. :113.0
# 1st Qu.:1.00 1st Qu.:24.90 1st Qu.:1.00 1st Qu.:187.0
# Median :2.00 Median :27.50 Median :1.00 Median :204.0
# Mean :1.76 Mean :27.48 Mean :1.24 Mean :204.9
# 3rd Qu.:2.00 3rd Qu.:30.10 3rd Qu.:1.00 3rd Qu.:229.0
# Max. :3.00 Max. :35.30 Max. :2.00 Max. :284.0
您的代码和 Katia 的回答都存在一些错误,并且 Katia 提供的 link 不再可用。
要在多重插补后计算简单的统计数据,您必须遵循鲁宾法则,这是在老鼠身上使用的方法,用于选择一组模型拟合。
使用时
library(mice)
imp <- mice(nhanes, seed = 23109)
mat <- complete(imp)
mat
age bmi hyp chl
1 1 28.7 1 199
2 2 22.7 1 187
3 1 30.1 1 187
4 3 22.7 2 204
5 1 20.4 1 113
6 3 24.9 2 184
7 1 22.5 1 118
8 1 30.1 1 187
9 2 22.0 1 238
10 2 30.1 1 229
11 1 35.3 1 187
12 2 27.5 1 229
13 3 21.7 1 206
14 2 28.7 2 204
15 1 29.6 1 238
16 1 29.6 1 238
17 3 27.2 2 284
18 2 26.3 2 199
19 1 35.3 1 218
20 3 25.5 2 206
21 1 33.2 1 238
22 1 33.2 1 229
23 1 27.5 1 131
24 3 24.9 1 284
25 2 27.4 1 186
您只return 第一个 估算数据集,而默认情况下您估算了五个。有关详细信息,请参阅 ?mice::complete
"The default is action = 1L returns the first imputed data set."
要获得五个估算数据集,您必须指定 mice::complete
的 action
参数
mat2 <- complete(imp, "long")
mat2
.imp .id age bmi hyp chl
1 1 1 1 28.7 1 199
2 1 2 2 22.7 1 187
3 1 3 1 30.1 1 187
4 1 4 3 22.7 2 204
5 1 5 1 20.4 1 113
6 1 6 3 24.9 2 184
7 1 7 1 22.5 1 118
8 1 8 1 30.1 1 187
9 1 9 2 22.0 1 238
10 1 10 2 30.1 1 229
...
115 5 15 1 29.6 1 187
116 5 16 1 25.5 1 187
117 5 17 3 27.2 2 284
118 5 18 2 26.3 2 199
119 5 19 1 35.3 1 218
120 5 20 3 25.5 2 218
121 5 21 1 22.7 1 238
122 5 22 1 33.2 1 229
123 5 23 1 27.5 1 131
124 5 24 3 24.9 1 186
125 5 25 2 27.4 1 186
summary(mat)
和summary(mat2)
都是假的。
让我们关注 bmi。第一个提供第一个估算数据集的平均 bmi。第二个提供了人工 m 倍大数据集的平均值。第二个数据集也有不适当的低方差。
mean(mat$bmi)
27.484
mean(mat2$bmi)
26.5192
我没有找到比手动将鲁宾规则应用于均值估计更好的解决方案。正确的估计只是所有估算数据集的估计平均值
res <- with(imp, mean(bmi)) #get the mean for each imputed dataset, stored in res$analyses
do.call(sum, res$analyses) / 5 #compute mean over m = 5 mean estimations
26.5192
必须适当计算方差/标准差。您可以使用鲁宾规则来计算您想要的任何简单统计数据。你可以在这里找到这样做的方法 https://bookdown.org/mwheymans/bookmi/rubins-rules.html
希望这对您有所帮助。
作为 R 初学者,我发现很难弄清楚如何计算乘法估算数据的描述性统计(比 运行 其他一些基本分析更难,例如相关性和回归) .
这些类型的问题以道歉(Descriptive statistics (Means, StdDevs) using multiply imputed data: R) but have have not been answered (https://stats.stackexchange.com/questions/296193/pooling-basic-descriptives-from-several-multiply-imputed-datasets-using-mice)开头,或者很快投反对票。
这里是 miceadds 函数的描述(https://www.rdocumentation.org/packages/miceadds/versions/2.10-14/topics/stats0),我发现很难理解以 mids 格式存储的数据。
我已经使用 summary(complete(imp)) 获得了一些输出,例如平均值、中位数、最小值、最大值,但很想知道如何获得额外的汇总输出(例如,skew/kurtosis、标准差, 方差).
从上一张海报借来的插图:
> imp <- mice(nhanes, seed = 23109)
iter imp variable
1 1 bmi hyp chl
1 2 bmi hyp chl
1 3 bmi hyp chl
1 4 bmi hyp chl
1 5 bmi hyp chl
2 1 bmi hyp chl
2 2 bmi hyp chl
2 3 bmi hyp chl
> summary(complete(imp))
age bmi hyp chl
1:12 Min. :20.40 1:18 Min. :113
2: 7 1st Qu.:24.90 2: 7 1st Qu.:186
3: 6 Median :27.40 Median :199
Mean :27.37 Mean :194
3rd Qu.:30.10 3rd Qu.:218
Max. :35.30 Max. :284
有人愿意花时间说明如何使用 mids 对象来获得基本描述吗?
以下是您可以执行的一些步骤,以更好地理解每一步之后 R 对象会发生什么。我还建议您查看本教程: https://gerkovink.github.io/miceVignettes/
library(mice)
# nhanes object is just a simple dataframe:
data(nhanes)
str(nhanes)
#'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
#$ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
#$ hyp: num NA 1 1 NA 1 NA 1 1 1 NA ...
#$ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
# you can generate multivariate imputation using mice() function
imp <- mice(nhanes, seed=23109)
#The output variable is an object of class "mids" which you can explore using str() function
str(imp)
# List of 17
# $ call : language mice(data = nhanes)
# $ data :'data.frame': 25 obs. of 4 variables:
# ..$ age: num [1:25] 1 2 1 3 1 3 1 1 2 2 ...
# ..$ bmi: num [1:25] NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
# ..$ hyp: num [1:25] NA 1 1 NA 1 NA 1 1 1 NA ...
# ..$ chl: num [1:25] NA 187 187 NA 113 184 118 187 238 NA ...
# $ m : num 5
# ...
# $ imp :List of 4
#..$ age: NULL
#..$ bmi:'data.frame': 9 obs. of 5 variables:
#.. ..$ 1: num [1:9] 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
#.. ..$ 2: num [1:9] 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
#.. ..$ 3: num [1:9] 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
#.. ..$ 4: num [1:9] 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
#.. ..$ 5: num [1:9] 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
#...
#You can extract individual components of this object using $, for example
#To view the actual imputation for bmi column
imp$imp$bmi
# 1 2 3 4 5
# 1 28.7 27.2 22.5 27.2 28.7
# 3 30.1 30.1 30.1 22.0 28.7
# 4 22.7 27.2 20.4 22.7 20.4
# 6 24.9 25.5 22.5 21.7 21.7
# 10 30.1 29.6 27.4 25.5 25.5
# 11 35.3 26.3 22.0 27.2 22.5
# 12 27.5 26.3 26.3 24.9 22.5
# 16 29.6 30.1 27.4 30.1 25.5
# 21 33.2 30.1 35.3 22.0 22.7
# The above output is again just a regular dataframe:
str(imp$imp$bmi)
# 'data.frame': 9 obs. of 5 variables:
# $ 1: num 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
# $ 2: num 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
# $ 3: num 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
# $ 4: num 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
# $ 5: num 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
# complete() function returns imputed dataset:
mat <- complete(imp)
# The output of this function is a regular data frame:
str(mat)
# 'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
# $ bmi: num 28.7 22.7 30.1 22.7 20.4 24.9 22.5 30.1 22 30.1 ...
# $ hyp: num 1 1 1 2 1 2 1 1 1 1 ...
# $ chl: num 199 187 187 204 113 184 118 187 238 229 ...
# So you can run any descriptive statistics you need with this object
# Just like you would do with a regular dataframe:
> summary(mat)
# age bmi hyp chl
# Min. :1.00 Min. :20.40 Min. :1.00 Min. :113.0
# 1st Qu.:1.00 1st Qu.:24.90 1st Qu.:1.00 1st Qu.:187.0
# Median :2.00 Median :27.50 Median :1.00 Median :204.0
# Mean :1.76 Mean :27.48 Mean :1.24 Mean :204.9
# 3rd Qu.:2.00 3rd Qu.:30.10 3rd Qu.:1.00 3rd Qu.:229.0
# Max. :3.00 Max. :35.30 Max. :2.00 Max. :284.0
您的代码和 Katia 的回答都存在一些错误,并且 Katia 提供的 link 不再可用。
要在多重插补后计算简单的统计数据,您必须遵循鲁宾法则,这是在老鼠身上使用的方法,用于选择一组模型拟合。
使用时
library(mice)
imp <- mice(nhanes, seed = 23109)
mat <- complete(imp)
mat
age bmi hyp chl
1 1 28.7 1 199
2 2 22.7 1 187
3 1 30.1 1 187
4 3 22.7 2 204
5 1 20.4 1 113
6 3 24.9 2 184
7 1 22.5 1 118
8 1 30.1 1 187
9 2 22.0 1 238
10 2 30.1 1 229
11 1 35.3 1 187
12 2 27.5 1 229
13 3 21.7 1 206
14 2 28.7 2 204
15 1 29.6 1 238
16 1 29.6 1 238
17 3 27.2 2 284
18 2 26.3 2 199
19 1 35.3 1 218
20 3 25.5 2 206
21 1 33.2 1 238
22 1 33.2 1 229
23 1 27.5 1 131
24 3 24.9 1 284
25 2 27.4 1 186
您只return 第一个 估算数据集,而默认情况下您估算了五个。有关详细信息,请参阅 ?mice::complete
"The default is action = 1L returns the first imputed data set."
要获得五个估算数据集,您必须指定 mice::complete
action
参数
mat2 <- complete(imp, "long")
mat2
.imp .id age bmi hyp chl
1 1 1 1 28.7 1 199
2 1 2 2 22.7 1 187
3 1 3 1 30.1 1 187
4 1 4 3 22.7 2 204
5 1 5 1 20.4 1 113
6 1 6 3 24.9 2 184
7 1 7 1 22.5 1 118
8 1 8 1 30.1 1 187
9 1 9 2 22.0 1 238
10 1 10 2 30.1 1 229
...
115 5 15 1 29.6 1 187
116 5 16 1 25.5 1 187
117 5 17 3 27.2 2 284
118 5 18 2 26.3 2 199
119 5 19 1 35.3 1 218
120 5 20 3 25.5 2 218
121 5 21 1 22.7 1 238
122 5 22 1 33.2 1 229
123 5 23 1 27.5 1 131
124 5 24 3 24.9 1 186
125 5 25 2 27.4 1 186
summary(mat)
和summary(mat2)
都是假的。
让我们关注 bmi。第一个提供第一个估算数据集的平均 bmi。第二个提供了人工 m 倍大数据集的平均值。第二个数据集也有不适当的低方差。
mean(mat$bmi)
27.484
mean(mat2$bmi)
26.5192
我没有找到比手动将鲁宾规则应用于均值估计更好的解决方案。正确的估计只是所有估算数据集的估计平均值
res <- with(imp, mean(bmi)) #get the mean for each imputed dataset, stored in res$analyses
do.call(sum, res$analyses) / 5 #compute mean over m = 5 mean estimations
26.5192
必须适当计算方差/标准差。您可以使用鲁宾规则来计算您想要的任何简单统计数据。你可以在这里找到这样做的方法 https://bookdown.org/mwheymans/bookmi/rubins-rules.html
希望这对您有所帮助。