为什么 Spark ML NaiveBayes 输出的标签与训练数据不同?
Why does Spark ML NaiveBayes output labels that are different from the training data?
我使用NaiveBayes classifier in Apache Spark ML(版本1.5.1)来预测一些文本类别。但是,分类器输出的标签与我训练集中的标签不同。我做错了吗?
这是一个小例子,可以粘贴到例如Zeppelin 笔记本:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row
// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
(0L, "X totally sucks :-(", 100.0),
(1L, "Today was kind of meh", 200.0),
(2L, "I'm so happy :-)", 300.0)
)).toDF("id", "text", "label")
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
val hashingTF = new HashingTF()
val nb = new NaiveBayes()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, nb))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sqlContext.createDataFrame(Seq(
(4L, "roller coasters are fun :-)"),
(5L, "i burned my bacon :-("),
(6L, "the movie is kind of meh")
)).toDF("id", "text")
// Make predictions on test documents.
.select("id", "text", "prediction")
.foreach { case Row(id: Long, text: String, prediction: Double) =>
println(s"($id, $text) --> prediction=$prediction")
(4, roller coasters are fun :-)) --> prediction=2.0
(5, i burned my bacon :-() --> prediction=0.0
(6, the movie is kind of meh) --> prediction=1.0
预测标签集 {0.0, 1.0, 2.0} 与我的训练集标签 {100.0, 200.0, 300.0} 不相交。
However, the classifier outputs labels that are different from the labels in my training set. Am I doing it wrong?
有点。据我所知,您遇到了 SPARK-9137 描述的问题。一般来说,ML 中的所有分类器都期望基于 0 的标签(0.0、1.0、2.0,...),但 ml.NaiveBayes
中没有验证步骤。在引擎盖下数据被传递给 mllib.NaiveBayes
当模型转换回 ml
时,预测函数简单地假设标签正确,returns predicted label using Vector.argmax
,因此得到的结果。您可以使用例如 StringIndexer
why do the training set labels have to be doubles, when any other type would work just as well as a label?
我想这主要是保持简单和可重用的问题 API。这样 LabeledPoint
我使用NaiveBayes classifier in Apache Spark ML(版本1.5.1)来预测一些文本类别。但是,分类器输出的标签与我训练集中的标签不同。我做错了吗?
这是一个小例子,可以粘贴到例如Zeppelin 笔记本:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row
// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
(0L, "X totally sucks :-(", 100.0),
(1L, "Today was kind of meh", 200.0),
(2L, "I'm so happy :-)", 300.0)
)).toDF("id", "text", "label")
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
val hashingTF = new HashingTF()
val nb = new NaiveBayes()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, nb))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sqlContext.createDataFrame(Seq(
(4L, "roller coasters are fun :-)"),
(5L, "i burned my bacon :-("),
(6L, "the movie is kind of meh")
)).toDF("id", "text")
// Make predictions on test documents.
.select("id", "text", "prediction")
.foreach { case Row(id: Long, text: String, prediction: Double) =>
println(s"($id, $text) --> prediction=$prediction")
(4, roller coasters are fun :-)) --> prediction=2.0
(5, i burned my bacon :-() --> prediction=0.0
(6, the movie is kind of meh) --> prediction=1.0
预测标签集 {0.0, 1.0, 2.0} 与我的训练集标签 {100.0, 200.0, 300.0} 不相交。
However, the classifier outputs labels that are different from the labels in my training set. Am I doing it wrong?
有点。据我所知,您遇到了 SPARK-9137 描述的问题。一般来说,ML 中的所有分类器都期望基于 0 的标签(0.0、1.0、2.0,...),但 ml.NaiveBayes
中没有验证步骤。在引擎盖下数据被传递给 mllib.NaiveBayes
当模型转换回 ml
时,预测函数简单地假设标签正确,returns predicted label using Vector.argmax
,因此得到的结果。您可以使用例如 StringIndexer
why do the training set labels have to be doubles, when any other type would work just as well as a label?
我想这主要是保持简单和可重用的问题 API。这样 LabeledPoint