extractPrediction() 是否支持因子?
Does extractPrediction() support factors?
我正在尝试使用随机森林模型作为我正在测试的几个模型之一,包括神经网络(nnet
和 neuralnet
),所有模型都使用方便的 caret
包。随机森林模型支持使用因子,所以对于这个模型,与其使用 dummyVars()
将因子转换为数值对比,我想我只是将它们保留为因子。这在训练步骤中工作正常 (train()
):
library(caret)
#Set dependent
seed = 123
y = "Sepal.Length"
#Partition (iris) data into train and test sets
set.seed(seed)
train.idx = createDataPartition(y = iris[,y], p = .8, list = FALSE)
train.set = iris[train.idx,]
test.set = iris[-train.idx,]
train.set = data.frame(train.set)
test.set = data.frame(test.set)
#Select features
features = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")
mod.features = paste(features, collapse = " + ")
#Create formula
mod.formula = as.formula(paste(y, mod.features, sep = " ~ "))
#Train model
mod <- train(mod.formula, data = train.set,
method = "rf")
但是当我尝试使用 extractPrediction()
时,它失败了:
#Test model with extractPrediction()
testPred = extractPrediction(models = list(mod),
testX = test.set[,features],
testY = test.set[,y])
Error in predict.randomForest(modelFit, newdata) : variables in the
training data missing in newdata
现在,据我所知,这是因为在调用 train() 期间,为因子创建了 1-hot 编码/对比,因此创建了一些新的变量名称。似乎基本的 predict() 方法即使在有因素的情况下也能正常工作:
#Test model with predict()
testPred = predict(mod$finalModel,
newData = test.set[, features])
当我使用 dummyVars()
将我的因子转换为数值对比时,extractPrediction()
工作正常:
#Train and test model using dummyVar
data.dummies = dummyVars(~.,data = iris)
data = predict(data.dummies, newdata = iris)
set.seed(seed)
train.idx = createDataPartition(y = data[,y], p = .8, list = FALSE)
train.set = data[train.idx,]
test.set = data[-train.idx,]
features = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species.setosa",
"Species.versicolor", "Species.virginica")
mod.features = paste(features, collapse = " + ")
#Create formula
mod.formula = as.formula(paste(y, mod.features, sep = " ~ "))
train.set = data.frame(train.set)
test.set = data.frame(test.set)
mod <- train(mod.formula, data = train.set,
method = "rf")
testPred = extractPrediction(models = list(mod),
testX = test.set[,features],
testY = test.set[,y])
谁能给我解释一下这是为什么?让 extractPrediction()
处理用于我的多模型测试管道的因素会很棒。我想我可以在开始时使用 dummyVars()
转换所有内容,但我很想知道为什么 extractPrediction()
在这种情况下不能使用因子,即使 predict()
确实有效。
如果您使用默认函数界面而不是使用公式的函数界面,那么您应该在做生意。
set.seed(1234)
mod_formula <- train(
Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
, data = iris
, method = "rf")
test_formula <- extractPrediction(
models = list(mod_formula)
)
set.seed(1234)
mod_default <- train(
y = iris$Sepal.Length
, x = iris[, c('Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species')]
, method = "rf")
test_default <- extractPrediction(
models = list(mod_default)
)
我正在尝试使用随机森林模型作为我正在测试的几个模型之一,包括神经网络(nnet
和 neuralnet
),所有模型都使用方便的 caret
包。随机森林模型支持使用因子,所以对于这个模型,与其使用 dummyVars()
将因子转换为数值对比,我想我只是将它们保留为因子。这在训练步骤中工作正常 (train()
):
library(caret)
#Set dependent
seed = 123
y = "Sepal.Length"
#Partition (iris) data into train and test sets
set.seed(seed)
train.idx = createDataPartition(y = iris[,y], p = .8, list = FALSE)
train.set = iris[train.idx,]
test.set = iris[-train.idx,]
train.set = data.frame(train.set)
test.set = data.frame(test.set)
#Select features
features = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")
mod.features = paste(features, collapse = " + ")
#Create formula
mod.formula = as.formula(paste(y, mod.features, sep = " ~ "))
#Train model
mod <- train(mod.formula, data = train.set,
method = "rf")
但是当我尝试使用 extractPrediction()
时,它失败了:
#Test model with extractPrediction()
testPred = extractPrediction(models = list(mod),
testX = test.set[,features],
testY = test.set[,y])
Error in predict.randomForest(modelFit, newdata) : variables in the training data missing in newdata
现在,据我所知,这是因为在调用 train() 期间,为因子创建了 1-hot 编码/对比,因此创建了一些新的变量名称。似乎基本的 predict() 方法即使在有因素的情况下也能正常工作:
#Test model with predict()
testPred = predict(mod$finalModel,
newData = test.set[, features])
当我使用 dummyVars()
将我的因子转换为数值对比时,extractPrediction()
工作正常:
#Train and test model using dummyVar
data.dummies = dummyVars(~.,data = iris)
data = predict(data.dummies, newdata = iris)
set.seed(seed)
train.idx = createDataPartition(y = data[,y], p = .8, list = FALSE)
train.set = data[train.idx,]
test.set = data[-train.idx,]
features = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species.setosa",
"Species.versicolor", "Species.virginica")
mod.features = paste(features, collapse = " + ")
#Create formula
mod.formula = as.formula(paste(y, mod.features, sep = " ~ "))
train.set = data.frame(train.set)
test.set = data.frame(test.set)
mod <- train(mod.formula, data = train.set,
method = "rf")
testPred = extractPrediction(models = list(mod),
testX = test.set[,features],
testY = test.set[,y])
谁能给我解释一下这是为什么?让 extractPrediction()
处理用于我的多模型测试管道的因素会很棒。我想我可以在开始时使用 dummyVars()
转换所有内容,但我很想知道为什么 extractPrediction()
在这种情况下不能使用因子,即使 predict()
确实有效。
如果您使用默认函数界面而不是使用公式的函数界面,那么您应该在做生意。
set.seed(1234)
mod_formula <- train(
Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
, data = iris
, method = "rf")
test_formula <- extractPrediction(
models = list(mod_formula)
)
set.seed(1234)
mod_default <- train(
y = iris$Sepal.Length
, x = iris[, c('Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species')]
, method = "rf")
test_default <- extractPrediction(
models = list(mod_default)
)