插入符号和 summaryFunction mnLogLoss 出错:与 'lev' 一致的列

Error with caret and summaryFunction mnLogLoss: columns consistent with 'lev'

我正在尝试使用对数损失作为 Caret 训练的损失函数,使用来自 Kaggle Kobe Bryant shot selection competition 的数据。

这是我的脚本:

library(caret)
data <- read.csv("./data.csv")

data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL

train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]

inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]

folds <- createFolds(train$shot_made_flag, k = 10)

ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)

这是错误的回溯:

7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, 
       metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm", 
       preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", 
       verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv", 
       "center", "scale"), trControl = ctrl, metric = "logLoss", 
       verbose = FALSE)

当我使用 defaultFunction 作为 summaryFunction 并且没有在 train 中指定度量时,它可以工作,但它不适用于 mnLogLoss。我猜它期望数据的格式与我传递的格式不同,但我找不到错误所在。

来自 defaultSummary 的帮助文件:

To use twoClassSummary and/or mnLogLoss, the classProbs argument of trainControl should be TRUE. multiClassSummary can be used without class probabilities but some statistics (e.g. overall log loss and the average of per-class area under the ROC curves) will not be in the result set.

因此,我认为您需要将 trainControl() 更改为以下内容:

ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)

如果您这样做并且 运行 您的代码,您将收到以下错误:

Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to  X0, X1 . Please use factor levels that can be used as valid R variable names  (see ?make.names for help).

您只需将 shot_made_flag 的 0/1 级别更改为可以作为有效 R 变量名称的内容:

data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")

通过以上更改,您的代码将如下所示:

library(caret)
data <- read.csv("./data.csv") 

data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL

train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]

inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]

folds <- createFolds(train$shot_made_flag, k = 3)

ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)