Error: BoxCox error during preprocess imputation R language
Error: BoxCox error during preprocess imputation R language
我正在查看应用预测建模一书 Max Kuhn 中第 6 章练习 3 问题的答案,我在插补预测步骤中遇到错误(尽管完全遵循他们的答案)。可重现的代码和问题如下:
library(AppliedPredictiveModeling)
library(caret)
library(RANN)
data(ChemicalManufacturingProcess)
predictors <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select="Yield")
# Impute
#Split data into training and test sets
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)
trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]
#Pre-process trainPredictors and apply to trainPredictors and testPredictors
pp <- preProcess(trainPredictors,method=c("BoxCox","center","scale","knnImpute"))
ppTrainPredictors <- predict(pp,newdata=trainPredictors)
ppTestPredictors <- predict(pp,newdata=testPredictors) # This results in an error
它给出的错误是:Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, : NA/NaN/Inf in foreign function call (arg 2)
当我改用 YeoJohnson 转换时,它似乎有效(我读到它能够处理非正数)
但是,我不明白为什么它不能处理测试数据,因为它只是训练数据的一个不同子集?它只是用于问题的插补步骤?
我找不到任何答案,这看起来很奇怪,因为肯定其他看过这本书的人会注意到吗?还是我太厚了?
谢谢
您收到该错误是因为 boxcox 转换不接受零。如果您查看 BoxCoxTrans 的帮助页面,它会写道:
If any(y <= 0) or if length(unique(y)) < numUnique, lambda is not
estimated and no transformation is applied.
因此,如果您的 preProcess()
在列中没有零的训练集上是 运行,则会应用 boxcox 变换,但它不适用于包含零的测试集。
在上面的书籍示例中,很可能种子是使用较旧的 R 版本设置的,所以它可以工作。如果您使用的是较新版本的 R,则它不起作用。所以如果我检查你的例子:
cbind(colSums(trainPredictors==0,na.rm=TRUE),colSums(testPredictors==0,na.rm=TRUE))
[,1] [,2]
BiologicalMaterial01 0 0
BiologicalMaterial02 0 0
BiologicalMaterial03 0 0
BiologicalMaterial04 0 0
BiologicalMaterial05 0 0
BiologicalMaterial06 0 0
BiologicalMaterial07 0 0
BiologicalMaterial08 0 0
BiologicalMaterial09 0 0
BiologicalMaterial10 0 0
BiologicalMaterial11 0 0
BiologicalMaterial12 0 0
ManufacturingProcess01 1 2
ManufacturingProcess02 29 6
ManufacturingProcess03 0 0
ManufacturingProcess04 0 0
ManufacturingProcess05 0 0
ManufacturingProcess06 0 0
ManufacturingProcess07 0 0
ManufacturingProcess08 0 0
ManufacturingProcess09 0 0
ManufacturingProcess10 0 0
ManufacturingProcess11 0 0
ManufacturingProcess12 104 38
ManufacturingProcess13 0 0
ManufacturingProcess14 0 0
ManufacturingProcess15 0 0
ManufacturingProcess16 1 0
ManufacturingProcess17 0 0
ManufacturingProcess18 1 0
你看ManufacturingProcess16
,ManufacturingProcess18
会给你出问题
Yeo-Johnson 变换可以处理零或负值,所以这不是问题。
如果您想继续工作示例,您可以尝试使用另一个种子:
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)
trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]
我正在查看应用预测建模一书 Max Kuhn 中第 6 章练习 3 问题的答案,我在插补预测步骤中遇到错误(尽管完全遵循他们的答案)。可重现的代码和问题如下:
library(AppliedPredictiveModeling)
library(caret)
library(RANN)
data(ChemicalManufacturingProcess)
predictors <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select="Yield")
# Impute
#Split data into training and test sets
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)
trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]
#Pre-process trainPredictors and apply to trainPredictors and testPredictors
pp <- preProcess(trainPredictors,method=c("BoxCox","center","scale","knnImpute"))
ppTrainPredictors <- predict(pp,newdata=trainPredictors)
ppTestPredictors <- predict(pp,newdata=testPredictors) # This results in an error
它给出的错误是:Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, : NA/NaN/Inf in foreign function call (arg 2)
当我改用 YeoJohnson 转换时,它似乎有效(我读到它能够处理非正数)
但是,我不明白为什么它不能处理测试数据,因为它只是训练数据的一个不同子集?它只是用于问题的插补步骤?
我找不到任何答案,这看起来很奇怪,因为肯定其他看过这本书的人会注意到吗?还是我太厚了?
谢谢
您收到该错误是因为 boxcox 转换不接受零。如果您查看 BoxCoxTrans 的帮助页面,它会写道:
If any(y <= 0) or if length(unique(y)) < numUnique, lambda is not estimated and no transformation is applied.
因此,如果您的 preProcess()
在列中没有零的训练集上是 运行,则会应用 boxcox 变换,但它不适用于包含零的测试集。
在上面的书籍示例中,很可能种子是使用较旧的 R 版本设置的,所以它可以工作。如果您使用的是较新版本的 R,则它不起作用。所以如果我检查你的例子:
cbind(colSums(trainPredictors==0,na.rm=TRUE),colSums(testPredictors==0,na.rm=TRUE))
[,1] [,2]
BiologicalMaterial01 0 0
BiologicalMaterial02 0 0
BiologicalMaterial03 0 0
BiologicalMaterial04 0 0
BiologicalMaterial05 0 0
BiologicalMaterial06 0 0
BiologicalMaterial07 0 0
BiologicalMaterial08 0 0
BiologicalMaterial09 0 0
BiologicalMaterial10 0 0
BiologicalMaterial11 0 0
BiologicalMaterial12 0 0
ManufacturingProcess01 1 2
ManufacturingProcess02 29 6
ManufacturingProcess03 0 0
ManufacturingProcess04 0 0
ManufacturingProcess05 0 0
ManufacturingProcess06 0 0
ManufacturingProcess07 0 0
ManufacturingProcess08 0 0
ManufacturingProcess09 0 0
ManufacturingProcess10 0 0
ManufacturingProcess11 0 0
ManufacturingProcess12 104 38
ManufacturingProcess13 0 0
ManufacturingProcess14 0 0
ManufacturingProcess15 0 0
ManufacturingProcess16 1 0
ManufacturingProcess17 0 0
ManufacturingProcess18 1 0
你看ManufacturingProcess16
,ManufacturingProcess18
会给你出问题
Yeo-Johnson 变换可以处理零或负值,所以这不是问题。
如果您想继续工作示例,您可以尝试使用另一个种子:
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)
trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]