具有未知预测的 scikit 分类器

Question

我打算使用 Python Scikit 进行一些文本分类，并计划使用 TfidfVectorizer 和 MultinomialNB。

但我意识到 MultinomialNB 将始终将我的样本预测到现有（已知）类别中。

例如，如果我有：

category A: trained with sample "this is green"
category B: trained with sample "this is blue"
category C: trained with sample "this is red"

我试着预测："this is yellow"

它会给我 category A（或任何其他，因为在这种情况下所有类别的概率都相同）。

我的问题是：对于上面的测试用例，是否有一个分类器可以给我 "unknown"（或 none，或错误，或错误）？

我想知道何时无法使用给定的训练集预测我的测试用例。

我想我可以检查是否 my_classifier.predict_proba(X_test)) returns 一个数组的所有值都相等或接近（在本例中：[[ 0.33333333 0.33333333 0.33333333]]）。

实际上，我必须检查这些值是否接近默认值，因为每个类别的概率可能不同 :)

所以...有什么更好的方法或...是否有我可以使用的具有一定置信度阈值的分类器？

Answer 1

如果您有一些未标记的训练数据，您可以添加一个包含所有未标记数据的 垃圾箱 class。在您的示例中，此 class 将具有 "not one of the colors green, blue or red" 的解释。 http://arxiv.org/abs/1511.03719

中详细描述了这种方法

Answer 2

您可以考虑做 novelty detection. I would check out that link and the associated example。在该示例中，想法是使用 a:

One-class SVM is an unsupervised algorithm that learns a decision function for novelty detection: classifying new data as similar or different to the training set.

（重点是我的。）我不知道在你的例子中它会如何处理少量数据，我猜 "poorly"，但我相信 novelty detection 是您正在寻找的那种东西。

具有未知预测的 scikit 分类器

scikit classifier with unknown prediction

scikit-learn

text-classification

naivebayes