如何使用 pyspark mllib RegressionMetrics 进行真实预测
How to use pyspark mllib RegressionMetrics with real predictions
对于 pyspark 1.4,我正在尝试使用 RegressionMetrics() 进行预测
由 LinearRegressionWithSGD 生成。
pyspark mllib documentations 中给出的所有 RegressionMetrics() 示例均用于 "artificial" 预测和观察
喜欢
predictionAndObservations = sc.parallelize([ (2.5, 3.0), (0.0, -0.5), (2.0, 2.0), (8.0, 7.0)])
对于这样的 "artificial"(由 sc.parallelize 生成)RDD 一切正常。但是,当对以另一种方式生成的另一个 RDD 执行相同操作时,我得到
TypeError: DoubleType can not accept object in type <type 'numpy.float64'>
下面是可重现的简短示例。
可能是什么问题?
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD, LinearRegressionModel
from pyspark.mllib.evaluation import RegressionMetrics
dataRDD = sc.parallelize([LabeledPoint(1, [1,1]), LabeledPoint(2, [2,2]), LabeledPoint(3, [3,3])])
lrModel = LinearRegressionWithSGD.train(dataRDD)
prediObserRDD = dataRDD.map(lambda p: (lrModel.predict(p.features), p.label)).cache()
让我们检查一下 RDD 确实是(预测,观察)对
prediObserRDD.take(4) # looks OK
现在尝试计算指标
metrics = RegressionMetrics(prediObserRDD)
出现如下错误
TypeError Traceback (most recent call last)
<ipython-input-1-ca9ad8e9faf1> in <module>()
7 prediObserRDD = dataRDD.map(lambda p: (lrModel.predict(p.features), p.label)).cache()
8 prediObserRDD.take(4)
----> 9 metrics = RegressionMetrics(prediObserRDD)
10 #metrics.explainedVariance
11 #metrics.meanAbsoluteError
/usr/local/spark-1.4.0-bin-hadoop2.6/python/pyspark/mllib/evaluation.py in __init__(self, predictionAndObservations)
99 df = sql_ctx.createDataFrame(predictionAndObservations, schema=StructType([
100 StructField("prediction", DoubleType(), nullable=False),
--> 101 StructField("observation", DoubleType(), nullable=False)]))
102 java_class = sc._jvm.org.apache.spark.mllib.evaluation.RegressionMetrics
103 java_model = java_class(df._jdf)
/usr/local/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
337
338 for row in rows:
--> 339 _verify_type(row, schema)
340
341 # convert python objects to sql data
/usr/local/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/types.py in _verify_type(obj, dataType)
1027 "length of fields (%d)" % (len(obj), len(dataType.fields)))
1028 for v, f in zip(obj, dataType.fields):
-> 1029 _verify_type(v, f.dataType)
1030
1031 _cached_cls = weakref.WeakValueDictionary()
/usr/local/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/types.py in _verify_type(obj, dataType)
1011 if type(obj) not in _acceptable_types[_type]:
1012 raise TypeError("%s can not accept object in type %s"
-> 1013 % (dataType, type(obj)))
1014
1015 if isinstance(dataType, ArrayType):
TypeError: DoubleType can not accept object in type <type 'numpy.float64'>
BinaryClassificationMetrics 也出现同样的问题(对于另一个数据集和分类任务)。
如错误所说 TypeError: DoubleType can not accept object in type <type 'numpy.float64'>
您正在尝试将 numpy.float64
转换为 Double,但无法完成。
要解决该类型错误,您必须将您的值转换为可接受的类型。
示例:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD, LinearRegressionModel
from pyspark.mllib.evaluation import RegressionMetrics
dataRDD = sc.parallelize([LabeledPoint(1, [1,1]), LabeledPoint(2, [2,2]), LabeledPoint(3, [3,3])])
lrModel = LinearRegressionWithSGD.train(dataRDD)
prediObserRDD = dataRDD.map(lambda p: (float(lrModel.predict(p.features)), p.label)).cache()
如果您注意到了,我已经使用 Python 内置 float
函数将预测标签转换为双精度标签。
现在您可以计算指标了:
>>> metrics = RegressionMetrics(prediObserRDD)
>>> metrics.explainedVariance
1.0
对于 pyspark 1.4,我正在尝试使用 RegressionMetrics() 进行预测 由 LinearRegressionWithSGD 生成。
pyspark mllib documentations 中给出的所有 RegressionMetrics() 示例均用于 "artificial" 预测和观察 喜欢
predictionAndObservations = sc.parallelize([ (2.5, 3.0), (0.0, -0.5), (2.0, 2.0), (8.0, 7.0)])
对于这样的 "artificial"(由 sc.parallelize 生成)RDD 一切正常。但是,当对以另一种方式生成的另一个 RDD 执行相同操作时,我得到
TypeError: DoubleType can not accept object in type <type 'numpy.float64'>
下面是可重现的简短示例。
可能是什么问题?
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD, LinearRegressionModel
from pyspark.mllib.evaluation import RegressionMetrics
dataRDD = sc.parallelize([LabeledPoint(1, [1,1]), LabeledPoint(2, [2,2]), LabeledPoint(3, [3,3])])
lrModel = LinearRegressionWithSGD.train(dataRDD)
prediObserRDD = dataRDD.map(lambda p: (lrModel.predict(p.features), p.label)).cache()
让我们检查一下 RDD 确实是(预测,观察)对
prediObserRDD.take(4) # looks OK
现在尝试计算指标
metrics = RegressionMetrics(prediObserRDD)
出现如下错误
TypeError Traceback (most recent call last)
<ipython-input-1-ca9ad8e9faf1> in <module>()
7 prediObserRDD = dataRDD.map(lambda p: (lrModel.predict(p.features), p.label)).cache()
8 prediObserRDD.take(4)
----> 9 metrics = RegressionMetrics(prediObserRDD)
10 #metrics.explainedVariance
11 #metrics.meanAbsoluteError
/usr/local/spark-1.4.0-bin-hadoop2.6/python/pyspark/mllib/evaluation.py in __init__(self, predictionAndObservations)
99 df = sql_ctx.createDataFrame(predictionAndObservations, schema=StructType([
100 StructField("prediction", DoubleType(), nullable=False),
--> 101 StructField("observation", DoubleType(), nullable=False)]))
102 java_class = sc._jvm.org.apache.spark.mllib.evaluation.RegressionMetrics
103 java_model = java_class(df._jdf)
/usr/local/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
337
338 for row in rows:
--> 339 _verify_type(row, schema)
340
341 # convert python objects to sql data
/usr/local/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/types.py in _verify_type(obj, dataType)
1027 "length of fields (%d)" % (len(obj), len(dataType.fields)))
1028 for v, f in zip(obj, dataType.fields):
-> 1029 _verify_type(v, f.dataType)
1030
1031 _cached_cls = weakref.WeakValueDictionary()
/usr/local/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/types.py in _verify_type(obj, dataType)
1011 if type(obj) not in _acceptable_types[_type]:
1012 raise TypeError("%s can not accept object in type %s"
-> 1013 % (dataType, type(obj)))
1014
1015 if isinstance(dataType, ArrayType):
TypeError: DoubleType can not accept object in type <type 'numpy.float64'>
BinaryClassificationMetrics 也出现同样的问题(对于另一个数据集和分类任务)。
如错误所说 TypeError: DoubleType can not accept object in type <type 'numpy.float64'>
您正在尝试将 numpy.float64
转换为 Double,但无法完成。
要解决该类型错误,您必须将您的值转换为可接受的类型。
示例:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD, LinearRegressionModel
from pyspark.mllib.evaluation import RegressionMetrics
dataRDD = sc.parallelize([LabeledPoint(1, [1,1]), LabeledPoint(2, [2,2]), LabeledPoint(3, [3,3])])
lrModel = LinearRegressionWithSGD.train(dataRDD)
prediObserRDD = dataRDD.map(lambda p: (float(lrModel.predict(p.features)), p.label)).cache()
如果您注意到了,我已经使用 Python 内置 float
函数将预测标签转换为双精度标签。
现在您可以计算指标了:
>>> metrics = RegressionMetrics(prediObserRDD)
>>> metrics.explainedVariance
1.0