决策树的 Spark 加载数据 - 在 LabelledPoint 中更改标签
Spark load Data for Decision tree - Change Label in LabelledPoint
我尝试在 https://spark.apache.org/docs/latest/mllib-decision-tree.html
的 spark 中做决策树的例子
我已经从 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a1a
下载了 a1a 数据集
数据集采用 LIBSVM 格式,其中两个 类 具有标签 +1.0 和 -1.0
当我尝试
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "/user/cloudera/testDT/a1a.t")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
| impurity, maxDepth, maxBins)
我得到:
java.lang.IllegalArgumentException: GiniAggregator given label -1.0 but requires label is non-negative.
所以我尝试将标签 -1.0 更改为 0.0。我试过
def changeLabel(a: org.apache.spark.mllib.regression.LabeledPoint) =
{ if (a.label == -1.0) {a.label = 0.0} }
哪里出现错误:
reassignment to val
所以我的问题是:如何更改数据的标签?或者是否有解决方法 DecisionTree.trainClassifier() 来处理带有负标签的数据?
TL;DR 你不能放弃 Product
class 的值参数,即使它是可能的(声明为 var
),您永远不应该修改 Spark 中的数据。
怎么样:
def changeLabel(a: org.apache.spark.mllib.regression.LabeledPoint) =
if (a.label == -1.0) a.copy(label = 0.0) else a
scala> changeLabel(LabeledPoint(-1.0, Vectors.dense(1.0, 2.0, 3.0)))
res1: org.apache.spark.mllib.regression.LabeledPoint = (0.0,[1.0,2.0,3.0])
scala> changeLabel(LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 3.0)))
res2: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[1.0,2.0,3.0])
我尝试在 https://spark.apache.org/docs/latest/mllib-decision-tree.html
的 spark 中做决策树的例子我已经从 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a1a
下载了 a1a 数据集数据集采用 LIBSVM 格式,其中两个 类 具有标签 +1.0 和 -1.0 当我尝试
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "/user/cloudera/testDT/a1a.t")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
| impurity, maxDepth, maxBins)
我得到:
java.lang.IllegalArgumentException: GiniAggregator given label -1.0 but requires label is non-negative.
所以我尝试将标签 -1.0 更改为 0.0。我试过
def changeLabel(a: org.apache.spark.mllib.regression.LabeledPoint) =
{ if (a.label == -1.0) {a.label = 0.0} }
哪里出现错误:
reassignment to val
所以我的问题是:如何更改数据的标签?或者是否有解决方法 DecisionTree.trainClassifier() 来处理带有负标签的数据?
TL;DR 你不能放弃 Product
class 的值参数,即使它是可能的(声明为 var
),您永远不应该修改 Spark 中的数据。
怎么样:
def changeLabel(a: org.apache.spark.mllib.regression.LabeledPoint) =
if (a.label == -1.0) a.copy(label = 0.0) else a
scala> changeLabel(LabeledPoint(-1.0, Vectors.dense(1.0, 2.0, 3.0)))
res1: org.apache.spark.mllib.regression.LabeledPoint = (0.0,[1.0,2.0,3.0])
scala> changeLabel(LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 3.0)))
res2: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[1.0,2.0,3.0])