如何修复 R 中未定义的列选择错误?
how to fix undefined column selected error in R?
尽管我的项目并没有完全使用 caret
r 包,但我打算使用 lasso
或 randomforest
进行预测。我使用 randomforest 对我的数据进行预测,但出现如下奇怪的错误:
> Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
> undefined columns selected
> In addition: There were 50 or more warnings (use warnings() to see the first 50)
我不明白为什么会这样。任何线索使这项工作?为什么我有这个错误?有什么想法吗?
最小可重现数据
这里是最小的可重现数据:
mydf = structure(list(taken_time = c(15L, 5L, 39L,
-21L, 46L, 121L, 9L, 100L, 70L, 92L, 31L, 37L), ap6xl = c(203.2893857,
4.858269406, 200, 14220, 218.2215352, 115.5227706, 4.858269406,
516.18125, 72.06166523, 4.858269406, 96.68516046, 386.1480917
), pct5 = c(732.074484, 25.67901235, 1900, 120.0477168, 3621.328567,
79.30561111, 8376.70314, 4183.709089, 59.77649029, 997.7490228,
118.9774144, 171.2285804), crp4 = c(196115424.7, 1073624.455,
10007, 1457496.474, 10343851.7, 81288042.73, 320405225.1, 334807893.9,
112950094.2, 15775090.31, 3008739.881, 127837638.1), age = c(52L,
74L, 52L, 67L, 82L, 67L, 71L, 84L, 58L, 52L, 81L, 60L), gender = structure(c(2L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("F",
"M"), class = "factor"), inpatient_readmission_time_rtd = c(79.78819444,
57.59068053, 57.59068053, 57.59068053, 57.59068053, 9.893055556,
150.1951389, 57.59068053, 134.05625, 57.59068053, 65.16041667,
17.46527778), infection_flag = c(0L, 0L, 1L, 1L, 0L, 1L, 0L,
1L, 1L, 1L, 1L, 0L), temperature_value = c(98.9, 98.9, 98, 101.3,
99.5, 98.1, 98.7, 97.1, 98.1, 98.2, 100.4, 98.8), heartrate_value = c(106,
61, 78, 91, 120, 68, 93.55081001, 122, 110, 75, 116, 111), pH_result_time_rta = c(11,
85.50402145, 85.50402145, 85.50402145, 85.50402145, 85.50402145,
85.50402145, 85.50402145, 85.50402145, 85.50402145, 50, 85.50402145
), gcst_value = c(15, 15, 15, 14.63769293, 15, 14.63769293, 15,
15, 15, 14.63769293, 15, 15)), row.names = c(NA, 12L), class = "data.frame")
我的尝试
这是我尝试过的方法,但 caret 只是对此有所抱怨。为什么?有什么想法吗?
library(caret)
fitControl <- trainControl(method = "repeatedcv",number = 10,repeats = 10, search = "random")
model_cv <- train(mydf$gcst_value ~ .,data = dat,method = "randomforest",
trControl = fitControl,na.action = na.omit)
immunoscore = predict(model_cv, mydf)
更新:
这是我的 r 会话:
> > sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build
> 18362)
>
> Matrix products: default
>
> Random number generation: RNG: Mersenne-Twister Normal:
> Inversion Sample: Rounding locale: [1] LC_COLLATE=English_United
> States.1252 LC_CTYPE=English_United States.1252 [3]
> LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages: [1] stats graphics grDevices utils
> datasets methods base
>
> other attached packages: [1] randomForest_4.6-14 data.table_1.12.8
> stringr_1.4.0 ranger_0.12.1 caret_6.0-86 [6]
> ggplot2_3.3.0 lattice_0.20-38 jsonlite_1.6.1
> dplyr_0.8.5
>
> loaded via a namespace (and not attached): [1] Rcpp_1.0.3
> pillar_1.4.3 compiler_3.6.3 gower_0.2.1
> plyr_1.8.6 [6] class_7.3-15 iterators_1.0.12
> tools_3.6.3 elasticnet_1.1.1 rpart_4.1-15 [11]
> ipred_0.9-9 lubridate_1.7.4 lifecycle_0.2.0
> tibble_2.1.3 gtable_0.3.0 [16] nlme_3.1-144
> pkgconfig_2.0.3 rlang_0.4.5 Matrix_1.2-18
> foreach_1.5.0 [21] rstudioapi_0.11 prodlim_2019.11.13
> withr_2.1.2 pROC_1.16.2 generics_0.0.2 [26]
> recipes_0.1.10 stats4_3.6.3 nnet_7.3-12
> grid_3.6.3 tidyselect_1.0.0 [31] glue_1.3.2
> R6_2.4.1 survival_3.1-8 lava_1.6.7
> reshape2_1.4.3 [36] purrr_0.3.3 magrittr_1.5
> lars_1.2 ModelMetrics_1.2.2.2 splines_3.6.3 [41]
> MASS_7.3-51.5 scales_1.1.0 codetools_0.2-16
> assertthat_0.2.1 timeDate_3043.102 [46] colorspace_1.4-1
> stringi_1.4.6 munsell_0.5.0 crayon_1.3.4
您需要解决两个问题:
您需要 data
中的所有列。这会导致您的问题出现错误,因为 gcst_value 与 data
参数 (dat
)
的 data.frame
不同
randomForest
不是有效模型。它在方法参数中由 rf
表示。
解决上述问题(请参阅下面的注释):
fitControl <- trainControl(method = "repeatedcv",number = 10,repeats = 10,
search = "random")
model_cv <- train(gcst_value ~ .,data = mydf,method = "rf",
trControl = fitControl,
na.action = na.omit)
immunoscore = predict(model_cv, mydf)
总结:
summary(model_cv)
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 12 -none- numeric
mse 500 -none- numeric
rsq 500 -none- numeric
oob.times 12 -none- numeric
importance 11 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
获得RMSE
(纯代表)
RMSE(immunoscore,mydf$gcst_value)
[1] 0.08737056
注意
本模型的有效性由原帖者负责
警告可能是由于模型有效性问题。我从答案中省略了那些。
进一步说明
关于检查警告信息(见上面的注释 1):
50: In randomForest.default(x, y, mtry = param$mtry, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
尽管我的项目并没有完全使用 caret
r 包,但我打算使用 lasso
或 randomforest
进行预测。我使用 randomforest 对我的数据进行预测,但出现如下奇怪的错误:
> Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
> undefined columns selected
> In addition: There were 50 or more warnings (use warnings() to see the first 50)
我不明白为什么会这样。任何线索使这项工作?为什么我有这个错误?有什么想法吗?
最小可重现数据
这里是最小的可重现数据:
mydf = structure(list(taken_time = c(15L, 5L, 39L,
-21L, 46L, 121L, 9L, 100L, 70L, 92L, 31L, 37L), ap6xl = c(203.2893857,
4.858269406, 200, 14220, 218.2215352, 115.5227706, 4.858269406,
516.18125, 72.06166523, 4.858269406, 96.68516046, 386.1480917
), pct5 = c(732.074484, 25.67901235, 1900, 120.0477168, 3621.328567,
79.30561111, 8376.70314, 4183.709089, 59.77649029, 997.7490228,
118.9774144, 171.2285804), crp4 = c(196115424.7, 1073624.455,
10007, 1457496.474, 10343851.7, 81288042.73, 320405225.1, 334807893.9,
112950094.2, 15775090.31, 3008739.881, 127837638.1), age = c(52L,
74L, 52L, 67L, 82L, 67L, 71L, 84L, 58L, 52L, 81L, 60L), gender = structure(c(2L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("F",
"M"), class = "factor"), inpatient_readmission_time_rtd = c(79.78819444,
57.59068053, 57.59068053, 57.59068053, 57.59068053, 9.893055556,
150.1951389, 57.59068053, 134.05625, 57.59068053, 65.16041667,
17.46527778), infection_flag = c(0L, 0L, 1L, 1L, 0L, 1L, 0L,
1L, 1L, 1L, 1L, 0L), temperature_value = c(98.9, 98.9, 98, 101.3,
99.5, 98.1, 98.7, 97.1, 98.1, 98.2, 100.4, 98.8), heartrate_value = c(106,
61, 78, 91, 120, 68, 93.55081001, 122, 110, 75, 116, 111), pH_result_time_rta = c(11,
85.50402145, 85.50402145, 85.50402145, 85.50402145, 85.50402145,
85.50402145, 85.50402145, 85.50402145, 85.50402145, 50, 85.50402145
), gcst_value = c(15, 15, 15, 14.63769293, 15, 14.63769293, 15,
15, 15, 14.63769293, 15, 15)), row.names = c(NA, 12L), class = "data.frame")
我的尝试
这是我尝试过的方法,但 caret 只是对此有所抱怨。为什么?有什么想法吗?
library(caret)
fitControl <- trainControl(method = "repeatedcv",number = 10,repeats = 10, search = "random")
model_cv <- train(mydf$gcst_value ~ .,data = dat,method = "randomforest",
trControl = fitControl,na.action = na.omit)
immunoscore = predict(model_cv, mydf)
更新:
这是我的 r 会话:
> > sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build
> 18362)
>
> Matrix products: default
>
> Random number generation: RNG: Mersenne-Twister Normal:
> Inversion Sample: Rounding locale: [1] LC_COLLATE=English_United
> States.1252 LC_CTYPE=English_United States.1252 [3]
> LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages: [1] stats graphics grDevices utils
> datasets methods base
>
> other attached packages: [1] randomForest_4.6-14 data.table_1.12.8
> stringr_1.4.0 ranger_0.12.1 caret_6.0-86 [6]
> ggplot2_3.3.0 lattice_0.20-38 jsonlite_1.6.1
> dplyr_0.8.5
>
> loaded via a namespace (and not attached): [1] Rcpp_1.0.3
> pillar_1.4.3 compiler_3.6.3 gower_0.2.1
> plyr_1.8.6 [6] class_7.3-15 iterators_1.0.12
> tools_3.6.3 elasticnet_1.1.1 rpart_4.1-15 [11]
> ipred_0.9-9 lubridate_1.7.4 lifecycle_0.2.0
> tibble_2.1.3 gtable_0.3.0 [16] nlme_3.1-144
> pkgconfig_2.0.3 rlang_0.4.5 Matrix_1.2-18
> foreach_1.5.0 [21] rstudioapi_0.11 prodlim_2019.11.13
> withr_2.1.2 pROC_1.16.2 generics_0.0.2 [26]
> recipes_0.1.10 stats4_3.6.3 nnet_7.3-12
> grid_3.6.3 tidyselect_1.0.0 [31] glue_1.3.2
> R6_2.4.1 survival_3.1-8 lava_1.6.7
> reshape2_1.4.3 [36] purrr_0.3.3 magrittr_1.5
> lars_1.2 ModelMetrics_1.2.2.2 splines_3.6.3 [41]
> MASS_7.3-51.5 scales_1.1.0 codetools_0.2-16
> assertthat_0.2.1 timeDate_3043.102 [46] colorspace_1.4-1
> stringi_1.4.6 munsell_0.5.0 crayon_1.3.4
您需要解决两个问题:
您需要
data
中的所有列。这会导致您的问题出现错误,因为 gcst_value 与data
参数 (dat
) 的 randomForest
不是有效模型。它在方法参数中由rf
表示。
data.frame
不同
解决上述问题(请参阅下面的注释):
fitControl <- trainControl(method = "repeatedcv",number = 10,repeats = 10,
search = "random")
model_cv <- train(gcst_value ~ .,data = mydf,method = "rf",
trControl = fitControl,
na.action = na.omit)
immunoscore = predict(model_cv, mydf)
总结:
summary(model_cv)
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 12 -none- numeric
mse 500 -none- numeric
rsq 500 -none- numeric
oob.times 12 -none- numeric
importance 11 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
获得RMSE
(纯代表)
RMSE(immunoscore,mydf$gcst_value)
[1] 0.08737056
注意
本模型的有效性由原帖者负责
警告可能是由于模型有效性问题。我从答案中省略了那些。
进一步说明
关于检查警告信息(见上面的注释 1):
50: In randomForest.default(x, y, mtry = param$mtry, ...) : The response has five or fewer unique values. Are you sure you want to do regression?