如何将朴素贝叶斯模型应用于新数据

Question

我今天早上问了一个关于这个的问题，但我删除了它并在此处发布了更好的措辞。

我使用训练和测试数据创建了我的第一个机器学习模型。我返回了一个混淆矩阵并看到了一些摘要统计信息。

我现在想将模型应用于新数据以进行预测，但我不知道如何做。

上下文：预测每月 "churn" 取消。目标变量是 "churned"，它有两个可能的标签 "churned" 和 "not churned".

    head(tdata)
  months_subscription nvk_medium                                org_type     churned
1                  25       none                               Community not churned
2                   7       none                            Sports clubs not churned
3                  28       none                            Sports clubs not churned
4                  18    unknown Religious congregations and communities not churned
5                  15       none              Association - Professional not churned
6                   9       none              Association - Professional not churned

这是我的训练和测试：

 library("klaR")
 library("caret")

# import data
test_data_imp <- read.csv("tdata.csv")

# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]

#training
rn_train <- sample(nrow(tdata),
                   floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)

# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)

到目前为止一切正常。

现在我有了新的数据，结构和布局与上面的 tdata 相同。我如何将我的模型应用到这个新数据来做出预测？凭直觉，我正在寻找一个新的 cbinded 列，其中每条记录都有预测的 class。

我试过这个：

## prediction ##
# import data
data_imp <- read.csv("pdata.csv")
pdata <- data_imp[variables]

actual_predictions <- predict(model, pdata)

#append to data and output (as head by default)
predicted_data <- cbind(pdata, actual_predictions$class)

# output
head(predicted_data)

哪个抛出错误

actual_predictions <- predict(model, pdata)
Error in object$tables[[v]][, nd] : subscript out of bounds
In addition: Warning messages:
1: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 1
2: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 2
3: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 3

如何将我的模型应用于新数据？我想要一个带有新列的新数据框，该列具有预测的 class?

** 在评论之后，这里是用于预测的新数据的 head 和 str**

head(pdata)
  months_subscription nvk_medium                                org_type     churned
1                  26       none                               Community not churned
2                   8       none                            Sports clubs not churned
3                  30       none                            Sports clubs not churned
4                  19    unknown Religious congregations and communities not churned
5                  16       none              Association - Professional not churned
6                  10       none              Association - Professional not churned
> str(pdata)
'data.frame':   6433 obs. of  4 variables:
 $ months_subscription: int  26 8 30 19 16 10 3 5 14 2 ...
 $ nvk_medium         : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
 $ org_type           : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
 $ churned            : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...

Answer 1

这很可能是由于训练数据中的因子编码不匹配（在您的情况下为变量 tdata）和 predict 函数中使用的新数据（变量 pdata)，通常表示您在测试数据中具有训练数据中不存在的因子水平。特征编码的一致性必须由您强制执行，因为 predict 函数不会检查它。因此，我建议您仔细检查两个变量中 nvk_medium 和 org_type 特征的水平。

错误信息：

 Error in object$tables[[v]][, nd] : subscript out of bounds

在评估数据点中的给定特征（第v个特征）时引发，其中nd是该特征对应的因子的数值。你也有warnings，说明数据点("observation")1、2、3中所有case的后验概率都为零，但不清楚这是否也与因子的编码有关...

要重现您看到的错误，请考虑以下玩具数据（来自 http://amunategui.github.io/binary-outcome-modeling/），它具有一组与您的数据有些相似的特征：

# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]

与：

> str(Data_train)

'data.frame':   656 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age     : num  35 28 34 28 29 28 28 28 45 28 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...

> str(Data_test)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

然后一切如期进行：

model <- NaiveBayes(Survived ~ ., data = Data_train)

# This will work
pred_1 <- predict(model, Data_test)

> str(pred_1)
List of 2
$ class    : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"

但是，如果编码不一致，例如：

# Mess things up, by "displacing" the factor values (i.e., 'Nothing' 
# will now be encoded as number 5, which was not present in the 
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
    as.character(Data_test_2$Title), 
    levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)

> str(Data_test_2)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

然后：

> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds

如何将朴素贝叶斯模型应用于新数据

How to apply Naive Bayes model to new data

r

naivebayes