R 火车,svmRadial "Cannot scale data"

R train, svmRadial "Cannot scale data"

我正在使用 R 和这个 breastCancer 数据框。我想使用包 caret 中的函数 train 但由于以下错误,它不起作用。但是,当我使用另一个数据框时,该功能有效。

library(mlbench)
library(caret)

data("breastCancer")
BC = na.omit(breastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")

这是错误:

error : In .local(x, ...) : Variable(s) `' constant. Cannot scale data.

您的代码包含一些拼写错误,例如包名称是 caret 而不是 caren,数据集名称是 BreastCancer 而不是 breastCancer。您可以使用以下代码来消除错误

library(mlbench)
library(caret)

data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")

它returns我

#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 683 samples
#>   9 predictor
#>   2 classes: 'benign', 'malignant' 
#> 
#> No pre-processing
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 683, 683, 683, 683, 683, 683, ... 
#> Resampling results across tuning parameters:
#> 
#>   C     Accuracy   Kappa    
#>   0.25  0.9550137  0.9034390
#>   0.50  0.9585504  0.9107666
#>   1.00  0.9611485  0.9161541
#> 
#> Tuning parameter 'sigma' was held constant at a value of 0.02349173
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 0.02349173 and C = 1.

我们可以从您拥有的数据开始:

library(mlbench)
library(caret)

data(BreastCancer)
BC = na.omit(BreastCancer[,-1])

str(BC)

'data.frame':   683 obs. of  10 variables:
 $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
 $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
 $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
 $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
 $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
 $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
 $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
 $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
 $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
 $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

BC 是 data.frame,您可以看到所有预测变量都是分类的或有序的。你正在尝试做一个 svmRadial,意思是 radial basis function 的 svm。计算分类特征之间的欧氏距离并不是那么简单,如果您查看类别的分布:

sapply(BC,table)
$Cl.thickness

  1   2   3   4   5   6   7   8   9  10 
139  50 104  79 128  33  23  44  14  69 

$Cell.size

  1   2   3   4   5   6   7   8   9  10 
373  45  52  38  30  25  19  28   6  67 

$Cell.shape

  1   2   3   4   5   6   7   8   9  10 
346  58  53  43  32  29  30  27   7  58 

$Marg.adhesion

  1   2   3   4   5   6   7   8   9  10 
393  58  58  33  23  21  13  25   4  55 

当你训练模型时,默认是bootstrap,你的一些训练数据会遗漏低代表的水平,例如从上面的table,类别9为Marg.adhesion。并且这个变量在这次训练中变为全零,因此它会抛出错误。它很可能不会对整体结果产生太大影响(因为它们很少见)。

一种解决方案是使用交叉验证(您不太可能 select 测试折叠中的所有罕见观察结果)。请注意,当你有一个带有因子和字符的 data.frame 时,你不应该使用 as.matrix() 转换成矩阵。 Caret 可以这样处理 data.frame:

train(Class ~.,data=BC,method="svmRadial",trControl=trainControl(method="cv"))
Support Vector Machines with Radial Basis Function Kernel 

683 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 614, 615, 615, 615, 616, 615, ... 
Resampling results across tuning parameters:

  C     Accuracy   Kappa    
  0.25  0.9575654  0.9101995
  0.50  0.9619346  0.9190284
  1.00  0.9633838  0.9220161

Tuning parameter 'sigma' was held constant at a value of 0.01841092
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01841092 and C = 1.

如果您想使用 bootstrap 进行交叉验证,另一种选择是忽略这些低 类 的观察结果,或者将它们与其他观察结果结合起来。