尝试在 Caret 包中传递自定义指标时出错

Error when trying to pass custom metric in Caret package

相关问题 - 1

我有一个这样的数据集:

> head(training_data)
  year     month channelGrouping visitStartTime visitNumber timeSinceLastVisit browser
1 2016   October          Social     1477775021           1                  0  Chrome
2 2016 September          Social     1473037945           1                  0  Safari
3 2017      July  Organic Search     1500305542           1                  0  Chrome
4 2017      July  Organic Search     1500322111           2              16569  Chrome
5 2016    August          Social     1471890172           1                  0  Safari
6 2017       May          Direct     1495146428           1                  0  Chrome         
  operatingSystem isMobile continent     subContinent       country      source   medium
1         Windows        0  Americas    South America        Brazil youtube.com referral
2       Macintosh        0  Americas Northern America United States youtube.com referral
3         Windows        0  Americas Northern America        Canada      google  organic
4         Windows        0  Americas Northern America        Canada      google  organic
5       Macintosh        0    Africa   Eastern Africa        Zambia youtube.com referral
6         Android        1  Americas Northern America United States    (direct)         
  isTrueDirect hits pageviews positiveTransaction
1            0    1         1                  No
2            0    1         1                  No
3            0    5         5                  No
4            1    3         3                  No
5            0    1         1                  No
6            1    6         6                  No

> str(training_data)
'data.frame':   1000 obs. of  18 variables:
 $ year               : int  2016 2016 2017 2017 2016 2017 2016 2017 2017 2016 ...
 $ month              : Factor w/ 12 levels "January","February",..: 10 9 7 7 8 5 10 3 3 12 ...
 $ channelGrouping    : chr  "Social" "Social" "Organic Search" "Organic Search" ...
 $ visitStartTime     : int  1477775021 1473037945 1500305542 1500322111 1471890172 1495146428 1476003570 1488556031 1490323225 1480696262 ...
 $ visitNumber        : int  1 1 1 2 1 1 1 1 1 1 ...
 $ timeSinceLastVisit : int  0 0 0 16569 0 0 0 0 0 0 ...
 $ browser            : chr  "Chrome" "Safari" "Chrome" "Chrome" ...
 $ operatingSystem    : chr  "Windows" "Macintosh" "Windows" "Windows" ...
 $ isMobile           : int  0 0 0 0 0 1 0 1 0 0 ...
 $ continent          : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 1 2 3 3 2 4 ...
 $ subContinent       : chr  "South America" "Northern America" "Northern America" "Northern America" ...
 $ country            : chr  "Brazil" "United States" "Canada" "Canada" ...
 $ source             : chr  "youtube.com" "youtube.com" "google" "google" ...
 $ medium             : chr  "referral" "referral" "organic" "organic" ...
 $ isTrueDirect       : int  0 0 0 1 0 1 0 0 0 0 ...
 $ hits               : int  1 1 5 3 1 6 1 1 2 1 ...
 $ pageviews          : int  1 1 5 3 1 6 1 1 2 1 ...
 $ positiveTransaction: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 …

然后我使用 Metrics 包定义我的自定义 RMSLE 函数:

rmsleMetric <- function(data, lev = NULL, model = NULL){
    out <- Metrics::rmsle(data$obs, data$pred)
    names(out) <- c("rmsle")
    return (out)
}

然后,我定义 trainControl:

tc <- trainControl(method = "repeatedcv",
   number = 5,
   repeats = 5,
   summaryFunction = rmsleMetric,
   classProbs = TRUE)

我的网格搜索:

tg <- expand.grid(alpha = 0, lambda = seq(0, 1, by = 0.1))

最后,我的模型:

penalizedLogit_ridge <- train(positiveTransaction ~ .,
    data = training_data,
    metric="rmsle",
    method = "glmnet",
    family = "binomial",
    trControl = tc,
    tuneGrid = tg
)

当我尝试 运行 上面的命令时,出现错误:

Something is wrong; all the rmsle metric values are missing:
     rmsle
 Min.   : NA
 1st Qu.: NA
 Median : NA
 Mean   :NaN
 3rd Qu.: NA
 Max.   : NA
 NA's   :11
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)

查看警告,我发现:

1: In Ops.factor(1, actual) : ‘+’ not meaningful for factors
2: In Ops.factor(1, predicted) : ‘+’ not meaningful for factors

重复25次

如果我使用 prSummary 作为我的汇总函数将指标更改为 AUC,同样的事情也能正常工作,所以我认为我的数据没有任何问题。

所以,我认为我的功能是错误的,但我不知道如何找出错误的原因。

非常感谢任何帮助。

您的自定义指标定义不正确。如果将 classProbs = TRUEsavePredictions = "final"trainControl 一起使用,您会发现有两列根据您的目标 class 命名,它们保存预测概率,而 data$pred 列包含无法用于计算所需指标的预测 class。

定义函数的正确方法是获取可能的级别并使用它们提取 classes 之一的概率:

rmsleMetric <- function(data, lev = NULL, model = NULL){
  lvls <- levels(data$obs)
  out <- Metrics::rmsle(ifelse(data$obs == lev[2], 0, 1),
                        data[, lvls[1]])
  names(out) <- c("rmsle")
  return (out)
}

有效吗:

library(caret)
library(mlbench)
data(Sonar)
tc <- trainControl(method = "repeatedcv",
                   number = 2,
                   repeats = 2,
                   summaryFunction = rmsleMetric,
                   classProbs = TRUE,
                   savePredictions = "final")
tg <- expand.grid(alpha = 0, lambda = seq(0, 1, by = 0.1))

penalizedLogit_ridge <- train(Class ~ .,
                              data = Sonar,
                              metric="rmsle",
                              method = "glmnet",
                              family = "binomial",
                              trControl = tc,
                              tuneGrid = tg)

#output
glmnet 

208 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (2 fold, repeated 2 times) 
Summary of sample sizes: 105, 103, 104, 104 
Resampling results across tuning parameters:

  lambda  rmsle    
  0.0     0.2835407
  0.1     0.2753197
  0.2     0.2768288
  0.3     0.2797847
  0.4     0.2827953
  0.5     0.2856088
  0.6     0.2881894
  0.7     0.2905501
  0.8     0.2927171
  0.9     0.2947169
  1.0     0.2965505

Tuning parameter 'alpha' was held constant at a value of 0
rmsle was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 1.

您可以检查 caret::twoClassSummary - 它的定义非常相似。