如何在 LogisticRegressionWithLBFGS 中为 pyspark 打印预测概率

Question

我正在使用 Spark 1.5.1 并且，在 pyspark 中，在我使用以下方法拟合模型后：

model = LogisticRegressionWithLBFGS.train(parsedData)

我可以使用以下方法打印预测：

model.predict(p.features)

是否有打印概率分数和预测值的函数？

Answer 1

您必须先 clear the threshold，这仅适用于 binary 分类：

 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
 from pyspark.mllib.regression import LabeledPoint

 parsed_data = [LabeledPoint(0.0, [4.6,3.6,1.0,0.2]),
                LabeledPoint(0.0, [5.7,4.4,1.5,0.4]),
                LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
                LabeledPoint(0.0, [4.8,3.4,1.6,0.2]),
                LabeledPoint(1.0, [4.4,3.2,1.3,0.2])]   

 model = LogisticRegressionWithLBFGS.train(sc.parallelize(parsed_data))
 model.threshold
 # 0.5
 model.predict(parsed_data[2].features)
 # 1

 model.clearThreshold()
 model.predict(parsed_data[2].features)
 # 0.9873840020002339

Answer 2

我认为问题是计算预测整个训练集的概率分数。如果是这样，我做了以下计算。不确定 post 是否仍然有效，但我是这样做的：

#get the original training data before it was converted to rows of LabelPoint.
#let us assume it is otd  ( of type spark DataFrame)
#let us extract the featureset as rdd by:
fs=otd.rdd.map(lambda x:x[1:]) # assuming label is col 0.

#the below is just a sample way of creating a Labelpoint rows..
parsedData= otd.rdd.map(lambda x: reg.LabeledPoint(int(x[0]-1),x[1:]))

# now convert otd to a panda DataFrame as:
ptd= otd.toPandas()
m= ptd.shape[0]
# train and get the model
model=LogisticRegressionWithLBFGS.train(trainingData,numClasses=10)


#Now store the model.predict rdd structures 
predict=model.predict(fs)
pr=predict.collect()

correct=0
correct = ((ptd.label-1) == (pr)).sum()
print((correct/m) *100)

注意上面是多classclass化。

如何在 LogisticRegressionWithLBFGS 中为 pyspark 打印预测概率

How to print the probability of prediction in LogisticRegressionWithLBFGS for pyspark

machine-learning

logistic-regression

apache-spark

pyspark

apache-spark-mllib