如何使用机器学习模型来预测特征略有不同的数据？

Question

我有一个在一堆 NLP 数据（每个词的 tf-idf 值）上训练的随机森林模型。我想用它来预测新的数据集。模型中的特征与新数据中的特征重叠但不完全匹配，因此当我对新数据进行预测时，我得到：

Error in predict.randomForest(object = model, newdata = new_data) : 
  variables in the training data missing in newdata

我想通过排除模型中未出现在新数据中的所有特征以及新数据中未出现在模型中的所有特征来解决此错误。暂时搁置对模型准确性的影响（这会显着减少特征的数量，但仍然有很多可以预测的），我做了这样的事情：

model$forest$xlevels <- model$forest$xlevels[colnames(new_data)]
# and vice versa
new_data <- new_data[names(model$forest$xlevels)]

这有效，因为 names(model$forest$xlevels) == colnames(new_data) 为每个特征名称返回了 TRUE。

但是，当我尝试预测结果 new_data 时，我仍然遇到 variables in the training data missing in newdata 错误。我相当确定我正在修改模型的正确部分 (model$forest$xlevels)，那么为什么它不起作用？

Answer 1

我认为你应该反过来。那就是将缺失的列添加到新数据中。

当您处理词袋时，一些新数据中不存在的词是很常见的。这些缺失的单词应该只编码为一列零。

# do something like this (also exclude the target variable, obviously)
names_missing <- names(traindata)[!names(traindata) %in% names(new_data)]
new_data[,names_missing] <- 0L

然后你应该能够预测

如何使用机器学习模型来预测特征略有不同的数据？

How can I use a machine learning model to predict on data whose features differ slightly?

nlp

r

machine-learning

random-forest