R - 有没有办法限制 'mi' 估算的值的范围? (使用 Kaggle Titanic 数据集)

R - Is there a way to restrict the range of values imputed by 'mi'? (Working with Kaggle Titanic data set)

我一直在努力 How to perform a Logistic Regression in R tutorial on R-bloggers, which the data set from the Kaggle Titanic challenge is used. A gist with all of the code in the post can be found here

训练数据集中存在缺失数据:

此数据集中包含 891 名乘客的数据(891 行),177 名乘客缺少 Age 值:

                             type missing method  model
PassengerId            continuous       0   <NA>   <NA>
Survived                   binary       0   <NA>   <NA>
Pclass        ordered-categorical       0   <NA>   <NA>
Name        unordered-categorical       0   <NA>   <NA>
Sex                        binary       0   <NA>   <NA>
Age                    continuous     177    ppd linear   <----
SibSp                  continuous       0   <NA>   <NA>
Parch                  continuous       0   <NA>   <NA>
Ticket      unordered-categorical       0   <NA>   <NA>
Fare                   continuous       0   <NA>   <NA>
Cabin       unordered-categorical     687    ppd mlogit
Embarked    unordered-categorical       2    ppd mlogit

在教程中,缺失值只是简单地替换为当前 Age 个值的平均值:

data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)

我有兴趣估算缺失值,而不是进行均值或中值替换。存在几个插补库,例如 amelia 和 MICE,但我过去使用过 mi,这就是为什么我选择使用 mi 来解决这个问题。

主要问题是我使用mi时估算值的范围不合理:

红色条是每个分布的平均值。乘客年龄范围从 0.42 到 80(岁)。估算值的范围从小于 -100 到大于 200。

显然这根本没有用。下面是我使用的代码。我使用 mi vignette 作为指南。

    library(mi)

    training.data.raw <- read.csv("train.csv", header = TRUE, na.strings = c(""))
    # create missing data frame for use with mi
    training.data.raw.mdf <- missing_data.frame(training.data.raw)
    #image(training.data.raw.mdf)


    # adjust variable types
    training.data.raw.mdf <- change(training.data.raw.mdf, y = "Parch", what = "type", to = "ord")
    training.data.raw.mdf <- change(training.data.raw.mdf, y = "SibSp", what = "type", to = "count")
    training.data.raw.mdf <- change(training.data.raw.mdf, y = "PassengerId", what = "type", to = "irrelevant")

    # parallel imputation should be default on non-Windows systems (i.e. Linux)
    imputations <- mi(training.data.raw.mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
    round(mipply(imputations, mean, to.matrix = TRUE), 3)

    # get data frames
    imputed.dataframes <- complete(imputations, m = 1)

有没有办法控制推算值的范围,使它们介于 0 和 80 之间?

我很乐意使用任何插补库 - mi、MICE、amelia - 只要产生合理的结果。任何产生合理结果的方法和任何库都值得关注。

尝试 mi package 中的 bounded-continuous-class 选项。这应该适合你。

这是文档中的示例:

# STEP 0: GET DATA
data(CHAIN, package = "mi")

# STEP 0.5 CREATE A missing_variable (you never need to actually do this)
lo_bound <- 0
hi_bound <- rep(Inf, nrow(CHAIN))
hi_bound[CHAIN$log_virus == 0] <- 6

log_virus <- missing_variable(ifelse(CHAIN$log_virus == 0, NA, CHAIN$log_virus),
                              type = "bounded-continuous",
                              lower = lo_bound, upper = hi_bound)

show(log_virus)