为什么 Spark ML NaiveBayes 输出的标签与训练数据不同?

Why does Spark ML NaiveBayes output labels that are different from the training data?

我使用NaiveBayes classifier in Apache Spark ML(版本1.5.1)来预测一些文本类别。但是,分类器输出的标签与我训练集中的标签不同。我做错了吗?

这是一个小例子,可以粘贴到例如Zeppelin 笔记本:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
  (0L, "X totally sucks :-(", 100.0),
  (1L, "Today was kind of meh", 200.0),
  (2L, "I'm so happy :-)", 300.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val nb = new NaiveBayes()

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, nb))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sqlContext.createDataFrame(Seq(
  (4L, "roller coasters are fun :-)"),
  (5L, "i burned my bacon :-("),
  (6L, "the movie is kind of meh")
)).toDF("id", "text")

// Make predictions on test documents.
model.transform(test)
  .select("id", "text", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prediction: Double) =>
    println(s"($id, $text) --> prediction=$prediction")
  }

小程序的输出:

(4, roller coasters are fun :-)) --> prediction=2.0
(5, i burned my bacon :-() --> prediction=0.0
(6, the movie is kind of meh) --> prediction=1.0

预测标签集 {0.0, 1.0, 2.0} 与我的训练集标签 {100.0, 200.0, 300.0} 不相交。

问题:如何将这些预测标签映射回我的原始训练集标签?

额外的问题:为什么训练集标签必须是双倍的,而任何其他类型都可以像标签一样工作?好像没必要。

However, the classifier outputs labels that are different from the labels in my training set. Am I doing it wrong?

有点。据我所知,您遇到了 SPARK-9137 描述的问题。一般来说,ML 中的所有分类器都期望基于 0 的标签(0.0、1.0、2.0,...),但 ml.NaiveBayes 中没有验证步骤。在引擎盖下数据被传递给 mllib.NaiveBayes 没有这个限制所以训练过程顺利进行。

当模型转换回 ml 时,预测函数简单地假设标签正确,returns predicted label using Vector.argmax,因此得到的结果。您可以使用例如 StringIndexer.

修复标签

why do the training set labels have to be doubles, when any other type would work just as well as a label?

我想这主要是保持简单和可重用的问题 API。这样 LabeledPoint 可以用于分类和回归问题。此外,它在内存使用和计算成本方面是一种有效的表示。