如何使用 MLlib 在 Spark 上生成(原始标签、预测标签)的元组?
How to generate tuples of (original label, predicted label) on Spark with MLlib?
我正在尝试使用从 MLlib on Spark 返回的模型进行预测。目标是生成 (orinalLabelInData, predictedLabel) 的元组。然后这些元组可用于模型评估目的。实现这一目标的最佳方法是什么?谢谢
假设 parsedTrainData 是 LabeledPoint 的 RDD
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
parsedTrainData = sc.parallelize([LabeledPoint(1.0, [11.0,-12.0,23.0]),
LabeledPoint(3.0, [-1.0,12.0,-23.0])])
model = DecisionTree.trainClassifier(parsedTrainData, numClasses=7,
categoricalFeaturesInfo={}, impurity='gini', maxDepth=8, maxBins=32)
model.predict(parsedTrainData.map(lambda x: x.features)).take(1)
这会返回预测,但我不确定如何将每个预测匹配回数据中的原始标签。
我试过了
parsedTrainData.map(lambda x: (x.label, dtModel.predict(x.features))).take(1)
但是,我将模型发送给工作人员的方式似乎在这里无效
/spark140/python/pyspark/context.pyc in __getnewargs__(self)
250 # This method is called when attempting to pickle SparkContext, which is always an error:
251 raise Exception(
--> 252 "It appears that you are attempting to reference SparkContext from a broadcast "
253 "variable, action, or transforamtion. SparkContext can only be used on the driver, "
254 "not in code that it run on workers. For more information, see SPARK-5063."
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
好吧,根据 official documentation,您可以像这样简单地压缩预测和标签:
predictions = model.predict(parsedTrainData.map(lambda x: x.features))
labelsAndPredictions = parsedTrainData.map(lambda x: x.label).zip(predictions)
我正在尝试使用从 MLlib on Spark 返回的模型进行预测。目标是生成 (orinalLabelInData, predictedLabel) 的元组。然后这些元组可用于模型评估目的。实现这一目标的最佳方法是什么?谢谢
假设 parsedTrainData 是 LabeledPoint 的 RDD
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
parsedTrainData = sc.parallelize([LabeledPoint(1.0, [11.0,-12.0,23.0]),
LabeledPoint(3.0, [-1.0,12.0,-23.0])])
model = DecisionTree.trainClassifier(parsedTrainData, numClasses=7,
categoricalFeaturesInfo={}, impurity='gini', maxDepth=8, maxBins=32)
model.predict(parsedTrainData.map(lambda x: x.features)).take(1)
这会返回预测,但我不确定如何将每个预测匹配回数据中的原始标签。
我试过了
parsedTrainData.map(lambda x: (x.label, dtModel.predict(x.features))).take(1)
但是,我将模型发送给工作人员的方式似乎在这里无效
/spark140/python/pyspark/context.pyc in __getnewargs__(self)
250 # This method is called when attempting to pickle SparkContext, which is always an error:
251 raise Exception(
--> 252 "It appears that you are attempting to reference SparkContext from a broadcast "
253 "variable, action, or transforamtion. SparkContext can only be used on the driver, "
254 "not in code that it run on workers. For more information, see SPARK-5063."
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
好吧,根据 official documentation,您可以像这样简单地压缩预测和标签:
predictions = model.predict(parsedTrainData.map(lambda x: x.features))
labelsAndPredictions = parsedTrainData.map(lambda x: x.label).zip(predictions)