为什么 Spark ML NaiveBayes 输出的标签与训练数据不同?
Why does Spark ML NaiveBayes output labels that are different from the training data?
我使用NaiveBayes classifier in Apache Spark ML(版本1.5.1)来预测一些文本类别。但是,分类器输出的标签与我训练集中的标签不同。我做错了吗?
这是一个小例子,可以粘贴到例如Zeppelin 笔记本:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row
// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
(0L, "X totally sucks :-(", 100.0),
(1L, "Today was kind of meh", 200.0),
(2L, "I'm so happy :-)", 300.0)
)).toDF("id", "text", "label")
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val nb = new NaiveBayes()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, nb))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sqlContext.createDataFrame(Seq(
(4L, "roller coasters are fun :-)"),
(5L, "i burned my bacon :-("),
(6L, "the movie is kind of meh")
)).toDF("id", "text")
// Make predictions on test documents.
model.transform(test)
.select("id", "text", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prediction: Double) =>
println(s"($id, $text) --> prediction=$prediction")
}
小程序的输出:
(4, roller coasters are fun :-)) --> prediction=2.0
(5, i burned my bacon :-() --> prediction=0.0
(6, the movie is kind of meh) --> prediction=1.0
预测标签集 {0.0, 1.0, 2.0} 与我的训练集标签 {100.0, 200.0, 300.0} 不相交。
问题:如何将这些预测标签映射回我的原始训练集标签?
额外的问题:为什么训练集标签必须是双倍的,而任何其他类型都可以像标签一样工作?好像没必要。
However, the classifier outputs labels that are different from the labels in my training set. Am I doing it wrong?
有点。据我所知,您遇到了 SPARK-9137 描述的问题。一般来说,ML 中的所有分类器都期望基于 0 的标签(0.0、1.0、2.0,...),但 ml.NaiveBayes
中没有验证步骤。在引擎盖下数据被传递给 mllib.NaiveBayes
没有这个限制所以训练过程顺利进行。
当模型转换回 ml
时,预测函数简单地假设标签正确,returns predicted label using Vector.argmax
,因此得到的结果。您可以使用例如 StringIndexer
.
修复标签
why do the training set labels have to be doubles, when any other type would work just as well as a label?
我想这主要是保持简单和可重用的问题 API。这样 LabeledPoint
可以用于分类和回归问题。此外,它在内存使用和计算成本方面是一种有效的表示。
我使用NaiveBayes classifier in Apache Spark ML(版本1.5.1)来预测一些文本类别。但是,分类器输出的标签与我训练集中的标签不同。我做错了吗?
这是一个小例子,可以粘贴到例如Zeppelin 笔记本:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row
// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
(0L, "X totally sucks :-(", 100.0),
(1L, "Today was kind of meh", 200.0),
(2L, "I'm so happy :-)", 300.0)
)).toDF("id", "text", "label")
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val nb = new NaiveBayes()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, nb))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sqlContext.createDataFrame(Seq(
(4L, "roller coasters are fun :-)"),
(5L, "i burned my bacon :-("),
(6L, "the movie is kind of meh")
)).toDF("id", "text")
// Make predictions on test documents.
model.transform(test)
.select("id", "text", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prediction: Double) =>
println(s"($id, $text) --> prediction=$prediction")
}
小程序的输出:
(4, roller coasters are fun :-)) --> prediction=2.0
(5, i burned my bacon :-() --> prediction=0.0
(6, the movie is kind of meh) --> prediction=1.0
预测标签集 {0.0, 1.0, 2.0} 与我的训练集标签 {100.0, 200.0, 300.0} 不相交。
问题:如何将这些预测标签映射回我的原始训练集标签?
额外的问题:为什么训练集标签必须是双倍的,而任何其他类型都可以像标签一样工作?好像没必要。
However, the classifier outputs labels that are different from the labels in my training set. Am I doing it wrong?
有点。据我所知,您遇到了 SPARK-9137 描述的问题。一般来说,ML 中的所有分类器都期望基于 0 的标签(0.0、1.0、2.0,...),但 ml.NaiveBayes
中没有验证步骤。在引擎盖下数据被传递给 mllib.NaiveBayes
没有这个限制所以训练过程顺利进行。
当模型转换回 ml
时,预测函数简单地假设标签正确,returns predicted label using Vector.argmax
,因此得到的结果。您可以使用例如 StringIndexer
.
why do the training set labels have to be doubles, when any other type would work just as well as a label?
我想这主要是保持简单和可重用的问题 API。这样 LabeledPoint
可以用于分类和回归问题。此外,它在内存使用和计算成本方面是一种有效的表示。