R - 有没有办法限制 'mi' 估算的值的范围? (使用 Kaggle Titanic 数据集)
R - Is there a way to restrict the range of values imputed by 'mi'? (Working with Kaggle Titanic data set)
我一直在努力 How to perform a Logistic Regression in R tutorial on R-bloggers, which the data set from the Kaggle Titanic challenge is used. A gist with all of the code in the post can be found here。
训练数据集中存在缺失数据:
此数据集中包含 891 名乘客的数据(891 行),177 名乘客缺少 Age
值:
type missing method model
PassengerId continuous 0 <NA> <NA>
Survived binary 0 <NA> <NA>
Pclass ordered-categorical 0 <NA> <NA>
Name unordered-categorical 0 <NA> <NA>
Sex binary 0 <NA> <NA>
Age continuous 177 ppd linear <----
SibSp continuous 0 <NA> <NA>
Parch continuous 0 <NA> <NA>
Ticket unordered-categorical 0 <NA> <NA>
Fare continuous 0 <NA> <NA>
Cabin unordered-categorical 687 ppd mlogit
Embarked unordered-categorical 2 ppd mlogit
在教程中,缺失值只是简单地替换为当前 Age
个值的平均值:
data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)
我有兴趣估算缺失值,而不是进行均值或中值替换。存在几个插补库,例如 amelia 和 MICE,但我过去使用过 mi
,这就是为什么我选择使用 mi
来解决这个问题。
主要问题是我使用mi
时估算值的范围不合理:
红色条是每个分布的平均值。乘客年龄范围从 0.42 到 80(岁)。估算值的范围从小于 -100 到大于 200。
显然这根本没有用。下面是我使用的代码。我使用 mi vignette 作为指南。
library(mi)
training.data.raw <- read.csv("train.csv", header = TRUE, na.strings = c(""))
# create missing data frame for use with mi
training.data.raw.mdf <- missing_data.frame(training.data.raw)
#image(training.data.raw.mdf)
# adjust variable types
training.data.raw.mdf <- change(training.data.raw.mdf, y = "Parch", what = "type", to = "ord")
training.data.raw.mdf <- change(training.data.raw.mdf, y = "SibSp", what = "type", to = "count")
training.data.raw.mdf <- change(training.data.raw.mdf, y = "PassengerId", what = "type", to = "irrelevant")
# parallel imputation should be default on non-Windows systems (i.e. Linux)
imputations <- mi(training.data.raw.mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
round(mipply(imputations, mean, to.matrix = TRUE), 3)
# get data frames
imputed.dataframes <- complete(imputations, m = 1)
有没有办法控制推算值的范围,使它们介于 0 和 80 之间?
我很乐意使用任何插补库 - mi、MICE、amelia - 只要产生合理的结果。任何产生合理结果的方法和任何库都值得关注。
尝试 mi
package 中的 bounded-continuous-class
选项。这应该适合你。
这是文档中的示例:
# STEP 0: GET DATA
data(CHAIN, package = "mi")
# STEP 0.5 CREATE A missing_variable (you never need to actually do this)
lo_bound <- 0
hi_bound <- rep(Inf, nrow(CHAIN))
hi_bound[CHAIN$log_virus == 0] <- 6
log_virus <- missing_variable(ifelse(CHAIN$log_virus == 0, NA, CHAIN$log_virus),
type = "bounded-continuous",
lower = lo_bound, upper = hi_bound)
show(log_virus)
我一直在努力 How to perform a Logistic Regression in R tutorial on R-bloggers, which the data set from the Kaggle Titanic challenge is used. A gist with all of the code in the post can be found here。
训练数据集中存在缺失数据:
此数据集中包含 891 名乘客的数据(891 行),177 名乘客缺少 Age
值:
type missing method model
PassengerId continuous 0 <NA> <NA>
Survived binary 0 <NA> <NA>
Pclass ordered-categorical 0 <NA> <NA>
Name unordered-categorical 0 <NA> <NA>
Sex binary 0 <NA> <NA>
Age continuous 177 ppd linear <----
SibSp continuous 0 <NA> <NA>
Parch continuous 0 <NA> <NA>
Ticket unordered-categorical 0 <NA> <NA>
Fare continuous 0 <NA> <NA>
Cabin unordered-categorical 687 ppd mlogit
Embarked unordered-categorical 2 ppd mlogit
在教程中,缺失值只是简单地替换为当前 Age
个值的平均值:
data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)
我有兴趣估算缺失值,而不是进行均值或中值替换。存在几个插补库,例如 amelia 和 MICE,但我过去使用过 mi
,这就是为什么我选择使用 mi
来解决这个问题。
主要问题是我使用mi
时估算值的范围不合理:
红色条是每个分布的平均值。乘客年龄范围从 0.42 到 80(岁)。估算值的范围从小于 -100 到大于 200。
显然这根本没有用。下面是我使用的代码。我使用 mi vignette 作为指南。
library(mi)
training.data.raw <- read.csv("train.csv", header = TRUE, na.strings = c(""))
# create missing data frame for use with mi
training.data.raw.mdf <- missing_data.frame(training.data.raw)
#image(training.data.raw.mdf)
# adjust variable types
training.data.raw.mdf <- change(training.data.raw.mdf, y = "Parch", what = "type", to = "ord")
training.data.raw.mdf <- change(training.data.raw.mdf, y = "SibSp", what = "type", to = "count")
training.data.raw.mdf <- change(training.data.raw.mdf, y = "PassengerId", what = "type", to = "irrelevant")
# parallel imputation should be default on non-Windows systems (i.e. Linux)
imputations <- mi(training.data.raw.mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
round(mipply(imputations, mean, to.matrix = TRUE), 3)
# get data frames
imputed.dataframes <- complete(imputations, m = 1)
有没有办法控制推算值的范围,使它们介于 0 和 80 之间?
我很乐意使用任何插补库 - mi、MICE、amelia - 只要产生合理的结果。任何产生合理结果的方法和任何库都值得关注。
尝试 mi
package 中的 bounded-continuous-class
选项。这应该适合你。
这是文档中的示例:
# STEP 0: GET DATA
data(CHAIN, package = "mi")
# STEP 0.5 CREATE A missing_variable (you never need to actually do this)
lo_bound <- 0
hi_bound <- rep(Inf, nrow(CHAIN))
hi_bound[CHAIN$log_virus == 0] <- 6
log_virus <- missing_variable(ifelse(CHAIN$log_virus == 0, NA, CHAIN$log_virus),
type = "bounded-continuous",
lower = lo_bound, upper = hi_bound)
show(log_virus)