R 火车，svmRadial "Cannot scale data"

Question

我正在使用 R 和这个 breastCancer 数据框。我想使用包 caret 中的函数 train 但由于以下错误，它不起作用。但是，当我使用另一个数据框时，该功能有效。

library(mlbench)
library(caret)

data("breastCancer")
BC = na.omit(breastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")

这是错误：

error : In .local(x, ...) : Variable(s) `' constant. Cannot scale data.

Answer 1

您的代码包含一些拼写错误，例如包名称是 caret 而不是 caren，数据集名称是 BreastCancer 而不是 breastCancer。您可以使用以下代码来消除错误

library(mlbench)
library(caret)

data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")

它returns我

#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 683 samples
#>   9 predictor
#>   2 classes: 'benign', 'malignant' 
#> 
#> No pre-processing
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 683, 683, 683, 683, 683, 683, ... 
#> Resampling results across tuning parameters:
#> 
#>   C     Accuracy   Kappa    
#>   0.25  0.9550137  0.9034390
#>   0.50  0.9585504  0.9107666
#>   1.00  0.9611485  0.9161541
#> 
#> Tuning parameter 'sigma' was held constant at a value of 0.02349173
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 0.02349173 and C = 1.

Answer 2

我们可以从您拥有的数据开始：

library(mlbench)
library(caret)

data(BreastCancer)
BC = na.omit(BreastCancer[,-1])

str(BC)

'data.frame':   683 obs. of  10 variables:
 $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
 $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
 $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
 $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
 $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
 $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
 $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
 $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
 $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
 $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

BC 是 data.frame，您可以看到所有预测变量都是分类的或有序的。你正在尝试做一个 svmRadial，意思是 radial basis function 的 svm。计算分类特征之间的欧氏距离并不是那么简单，如果您查看类别的分布：

sapply(BC,table)
$Cl.thickness

  1   2   3   4   5   6   7   8   9  10 
139  50 104  79 128  33  23  44  14  69 

$Cell.size

  1   2   3   4   5   6   7   8   9  10 
373  45  52  38  30  25  19  28   6  67 

$Cell.shape

  1   2   3   4   5   6   7   8   9  10 
346  58  53  43  32  29  30  27   7  58 

$Marg.adhesion

  1   2   3   4   5   6   7   8   9  10 
393  58  58  33  23  21  13  25   4  55

当你训练模型时，默认是bootstrap，你的一些训练数据会遗漏低代表的水平，例如从上面的table，类别9为Marg.adhesion。并且这个变量在这次训练中变为全零，因此它会抛出错误。它很可能不会对整体结果产生太大影响（因为它们很少见）。

一种解决方案是使用交叉验证（您不太可能 select 测试折叠中的所有罕见观察结果）。请注意，当你有一个带有因子和字符的 data.frame 时，你不应该使用 as.matrix() 转换成矩阵。 Caret 可以这样处理 data.frame：

train(Class ~.,data=BC,method="svmRadial",trControl=trainControl(method="cv"))
Support Vector Machines with Radial Basis Function Kernel 

683 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 614, 615, 615, 615, 616, 615, ... 
Resampling results across tuning parameters:

  C     Accuracy   Kappa    
  0.25  0.9575654  0.9101995
  0.50  0.9619346  0.9190284
  1.00  0.9633838  0.9220161

Tuning parameter 'sigma' was held constant at a value of 0.01841092
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01841092 and C = 1.

如果您想使用 bootstrap 进行交叉验证，另一种选择是忽略这些低类的观察结果，或者将它们与其他观察结果结合起来。

R 火车，svmRadial "Cannot scale data"

R train, svmRadial "Cannot scale data"

r

svm

r-caret