R 不将列名识别为预测变量

Question

我正在尝试建立一个随机森林模型，使用前哨 2 波段以及 vnir 和 mnir 光谱作为预测变量来预测砷浓度。模型构造如下：

RF_RAW <- randomForest(Arsenic ~  ., data = merged[, c(5, 10:1944)], importance= TRUE, na.action=na.omit )

列名出现问题，因为函数无法识别它

Error in eval(predvars, data, env) : object '400' not found

如果我选择省略 400，任何其他波长都会发生同样的情况，vnir 和 mnir 光谱，Sentinel 波段列：B1 到 B12 工作，RF 仅与它们一起工作。我想不通

    head(merged)
# A tibble: 6 x 1,944
  Sample_ID   Corg   H2O   KCL Arsenic Phospate   FID Longitude Latitude     B1     B2     B3     B4     B5    B6    B7    B8
  <chr>      <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>     <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 P1_S1_1    1.16   5.6   4.32  47958.     597.    10      16.9     50.4 0.0371 0.0358 0.0527 0.0557 0.101  0.204 0.224 0.219
2 P1_S1_2    0.470  6.24  4.7   50720.     398.     9      16.9     50.4 0.0371 0.0395 0.0606 0.0641 0.101  0.204 0.224 0.259
3 P2_S1_1    1.33   6.28  4.91  31055.     316.     8      16.9     50.4 0.0323 0.0436 0.0682 0.0729 0.107  0.182 0.226 0.232
4 P2_S1_2   12.7    7.67  7.07  37492.     695.     7      16.9     50.4 0.0322 0.034  0.0576 0.0422 0.102  0.223 0.231 0.288
5 P3_S3_1    0.617  6.76  5.64  54245.     249.     6      16.9     50.4 0.0322 0.0438 0.0662 0.0696 0.113  0.213 0.253 0.233
6 P4_S4_1   20.8    4.41  3.21   7175.     731.    11      16.9     50.4 0.027  0.0253 0.0554 0.0272 0.0966 0.281 0.356 0.404
# ... with 1,927 more variables: B8A <dbl>, B9 <dbl>, B11 <dbl>, B12 <dbl>, 400 <dbl>, 401 <dbl>, 402 <dbl>, 403 <dbl>,
#   404 <dbl>, 405 <dbl>, 406 <dbl>, 407 <dbl>, 408 <dbl>, 409 <dbl>, 410 <dbl>, 411 <dbl>, 412 <dbl>, 413 <dbl>, 414 <dbl>,
#   415 <dbl>, 416 <dbl>, 417 <dbl>, 418 <dbl>, 419 <dbl>, 420 <dbl>, 421 <dbl>, 422 <dbl>, 423 <dbl>, 424 <dbl>, 425 <dbl>,
#   426 <dbl>, 427 <dbl>, 428 <dbl>, 429 <dbl>, 430 <dbl>, 431 <dbl>, 432 <dbl>, 433 <dbl>, 434 <dbl>, 435 <dbl>, 436 <dbl>,
#   437 <dbl>, 438 <dbl>, 439 <dbl>, 440 <dbl>, 441 <dbl>, 442 <dbl>, 443 <dbl>, 444 <dbl>, 445 <dbl>, 446 <dbl>, 447 <dbl>,
#   448 <dbl>, 449 <dbl>, 450 <dbl>, 451 <dbl>, 452 <dbl>, 453 <dbl>, 454 <dbl>, 455 <dbl>, 456 <dbl>, 457 <dbl>, 458 <dbl>,
#   459 <dbl>, 460 <dbl>, 461 <dbl>, 462 <dbl>, 463 <dbl>, 464 <dbl>, 465 <dbl>, 466 <dbl>, 467 <dbl>, 468 <dbl>, ...,

colnames(merged)
   [1] "Sample_ID" "Corg"      "H2O"       "KCL"       "Arsenic"   "Phospate"  "FID"       "Longitude" "Latitude" 
  [10] "B1"        "B2"        "B3"        "B4"        "B5"        "B6"        "B7"        "B8"        "B8A"      
  [19] "B9"        "B11"       "B12"       "400"       "401"       "402"       "403"       "404"       "405"      
  [28] "406"       "407"       "408"       "409"       "410"       "411"       "412"       "413"       "414"      
  [37] "415"       "416"       "417"       "418"       "419"       "420"       "421"       "422"       "423"      
  [46] "424"       "425"       "426"       "427"       "428"       "429"       "430"       "431"       "432"

等等。 https://drive.google.com/file/d/1xifstUBv6sqa8-c51ukRsw9oZyKtzK03/view?usp=sharing 是本例中使用的 csv 合并。我有类似的应用光谱变换，它们的行为相同。

提前致谢

Answer 1

看起来 randomForest 不喜欢数字列名，修正列名。试试这个例子：

# example data
x <- mtcars[1:10, 1:3]
x[, "400"] <- x$disp

library(randomForest)
colnames(x)
# [1] "mpg"  "cyl"  "disp" "400" 

# as expected we get error:
randomForest(mpg ~ ., data = x)
# Error in eval(predvars, data, env) : object '400' not found

# now fix the column names
colnames(x) <- make.names(colnames(x))
colnames(x)
# [1] "mpg"  "cyl"  "disp" "X400"

randomForest(mpg ~ ., data = x)
# Call:
#   randomForest(formula = mpg ~ ., data = x) 
# Type of random forest: regression
# Number of trees: 500
# No. of variables tried at each split: 1
# 
# Mean of squared residuals: 5.218906
# % Var explained: 31.39

R 不将列名识别为预测变量

R does not recognize column names as predictors

eval

r

random-forest