在 Caret 中对 SVM String Kernel 建模的好例子?

Good Example to model SVM String Kernel in Caret?

在这里,我尝试使用 Caret

对 SVM String Kernel 进行建模

使用数据集:

library(caret)
library(mlbench)
library(dplyr)
data("HouseVotes84")
dummy_data_classif <- HouseVotes84[,2:length(colnames(HouseVotes84))] %>% 
  mutate_if(is.factor, as.numeric)
dummy_data_classif <- data.frame(cbind(Class=HouseVotes84[,1], dummy_data_classif))
dummy_data_classif[is.na(dummy_data_classif)] <- 0
dummy_data_classif <- as.matrix(dummy_data_classif)
dummy_y_classif <- as.matrix(dummy_data_classif[,which(colnames(dummy_data_classif) == "Class")])
colnames(dummy_y_classif) <- "Class"
dummy_x_classif <- dummy_data_classif[,-which(colnames(dummy_data_classif) == "Class")]

data("cars") #available from caret
dummy_data_regr <- cars
dummy_data_regr <- dummy_data_regr %>%
mutate_if(is.numeric, as.character)
dummy_data_regr <- dummy_data_regr %>%
mutate_if(is.integer, as.character)
dummy_data_regr <- as.matrix(dummy_data_regr)
dummy_y_regr <- as.matrix(dummy_data_regr[,which(colnames(dummy_data_regr) == "Price")])
colnames(dummy_y_classif) <- "Price"
dummy_x_regr <- dummy_data_regr[,-which(colnames(dummy_data_regr) == "Price")]

使用重采样

resampling <- trainControl(method = "cv",
                               number = 5,
                               allowParallel = FALSE) 

我尝试用 3 种方法测试它们:svmBoundrangeString, svmExpoString, svmSpectrumString

test_method <- c("svmBoundrangeString", "svmExpoString", "svmSpectrumString")
model_reg <- caret::train(x=dummy_x_regr,
                      y=dummy_y_regr, 
                      data = dummy_data, 
                      method = test_method[1], 
                      trControl = resampling)

model_cls <- caret::train(x=dummy_x_classif,
                      y=dummy_y_classif, 
                      data = dummy_data, 
                      method = test_method[1], 
                      trControl = resampling)

但这不起作用,缺少指标,如果我尝试对这些方法进行操作:

Something is wrong; all the Accuracy metric values are missing

 Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :9     NA's   :9  

我该怎么做才能让它发挥作用?或者这些方法可能需要特定的数据帧?

这三种方法是string kernel based, I am not very sure how it can be used in regression, but in classification, you would have the text as the independent variable. In the case of kernlab, you would provide it as a list, see this vignette too:

library(kernlab)
data(reuters)

head(reuters[1:2])
[[1]]
[1] "Computer Terminal Systems Inc said \nit has completed the sale of 200,000 shares of its common \nstock, and warrants to acquire an additional one mln shares, to \n<Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs. \n    The company said the warrants are exercisable for five \nyears at a purchase price of .125 dlrs per share. \n    Computer Terminal said Sedio also has the right to buy \nadditional shares and increase its total holdings up to 40 pct \nof the Computer Terminal's outstanding common stock under \ncertain circumstances involving change of control at the \ncompany. \n    The company said if the conditions occur the warrants would \nbe exercisable at a price equal to 75 pct of its common stock's \nmarket price at the time, not to exceed 1.50 dlrs per share. \n    Computer Terminal also said it sold the technolgy rights to \nits Dot Matrix impact technology, including any future \nimprovements, to <Woodco Inc> of Houston, Tex. for 200,000 \ndlrs. But, it said it would continue to be the exclusive \nworldwide licensee of the technology for Woodco. \n    The company said the moves were part of its reorganization \nplan and would help pay current operation costs and ensure \nproduct delivery. \n    Computer Terminal makes computer generated labels, forms, \ntags and ticket printers and terminals. \n Reuter"

[[2]]
[1] "Ohio Mattress Co said its first \nquarter, ending February 28, profits may be below the 2.4 mln \ndlrs, or 15 cts a share, earned in the first quarter of fiscal \n1986. \n    The company said any decline would be due to expenses \nrelated to the acquisitions in the middle of the current \nquarter of seven licensees of Sealy Inc, as well as 82 pct of \nthe outstanding capital stock of Sealy. \n    Because of these acquisitions, it said, first quarter sales \nwill be substantially higher than last year's 67.1 mln dlrs. \n    Noting that it typically reports first quarter results in \nlate march, said the report is likely to be issued in early \nApril this year. \n    It said the delay is due to administrative considerations, \nincluding conducting appraisals, in connection with the \nacquisitions. \n Reuter"

 str(rlabels)
 Factor w/ 2 levels "acq","crude": 1 1 1 1 1 1 1 1 1 1 ...

mdl <- ksvm(reuters,rlabels,kernel="stringdot",kpar=list(length=5,type = "boundrange"),C=3)

现在,如果为此使用插入符号,您可以看到如何使用 getModelInfo("svmBoundrangeString") 调用它,并且基本上,您将自变量作为具有 1 列的矩阵和列名(我使用 cbind 下面):

mdl = train(x=cbind(reuters=reuters),y=rlabels,
method="svmBoundrangeString",trControl=trainControl(method="cv"))

Support Vector Machines with Boundrange String Kernel 

40 samples
 1 predictor
 2 classes: 'acq', 'crude' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... 
Resampling results across tuning parameters:

  length  C     Accuracy  Kappa
  2       0.25  0.775     0.55 
  2       0.50  0.775     0.55 
  2       1.00  0.775     0.55 
  3       0.25  0.800     0.60 
  3       0.50  0.800     0.60 
  3       1.00  0.800     0.60 
  4       0.25  0.825     0.65 
  4       0.50  0.825     0.65 
  4       1.00  0.825     0.65 

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were length = 4 and C = 0.25.