使用 glmnet 进行整洁的预测和混淆矩阵

Question

考虑这个例子：

library(quanteda)
library(caret)
library(glmnet)
library(dplyr)

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c("Y", "Y", "Y", "N"))

# now we make the dataframe bigger 
dtrain <- purrr::map_df(seq_len(100), function(x) dtrain)

让我们创建一个稀疏文档术语矩阵和运行一些 glmnet

> dtrain <- dtrain %>% mutate(class = as.factor(class))
> mycorpus <- corpus(dtrain,  text_field = 'text')
> trainingdf <- dfm(mycorpus)
> trainingdf
Document-feature matrix of: 400 documents, 6 features (62.5% sparse).

现在我们终于转向套索模型

mymodel <- cv.glmnet(x = trainingdf, y =dtrain$class, 
                     type.measure ='class',
                     nfolds = 3,
                     alpha = 1,
                     parallel = FALSE,
                     family = 'binomial')

我有两个简单的问题。

如何将预测添加到原始 dtrain 数据中？事实上，

的输出

mypred <- predict.cv.glmnet(mymodel, newx = trainingdf, 
                         s = 'lambda.min', type = 'class')

看起来非常不整洁：

> mypred
    1  
1   "Y"
2   "Y"
3   "Y"

如何在此设置中使用 caret::confusionMatrix？仅使用以下内容会产生错误：

confusion <- caret::confusionMatrix(data =mypred, 
+                                     reference = dtrain$class)
Error: `data` and `reference` should be factors with the same levels.

谢谢！

Answer 1

在每个 class化模型中，目标变量的 class 需要 factor。

例如：

my_data 是您训练模型的数据集，my_target 是预测变量。

请注意 as.factor(my_data$my_target) 会自动为您找到正确的 levels。

我的意思是您不需要手动指定 levels，但 R 会为您指定。

看看我们调用target时的区别：

target <- c("y", "n", "y", "n")
target
#[1] "y" "n" "y" "n" # this is a simple char
as.factor(target)
# [1] y n y n
# Levels: n y # this is a correct format, a factor with levels

这很重要，因为即使您的预测（或测试数据）仅显示 target 中的两个 class 之一，模型也会知道实际 levels 可以更多。

你当然可以设置它们：

my_pred <- factor(mypred, levels = c("Y", "N"))

要将它们添加到数据中，您可以使用

my_data$newpred <- my_pred

或

library(dplyr)
my_data %>% mutate(newpred = my_pred)

使用 glmnet 进行整洁的预测和混淆矩阵

tidy predictions and confusion matrix with glmnet

r

glmnet

r-caret