在 Apache Spark 上训练逻辑回归模型时出错。火花-5063

Question

我正在尝试使用 Apache Spark 构建逻辑回归模型。这是代码。

parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data 
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)

但是我得到这个错误：

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

我不确定如何解决这个问题。任何帮助将不胜感激。

Answer 1

您看到的问题与我在中描述的问题几乎相同要进行转换，您必须调用 Scala 函数，并且它需要访问 SparkContext 因此出现错误见。

处理此问题的标准方法是仅处理数据的必需部分，然后压缩结果。

labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)

scaledData = (labels
    .zip(featuresTransformed)
    .map(lambda p: LabeledPoint(p[0], p[1])))

modelScaledSGD = LogisticRegressionWithSGD.train(...)

如果不打算基于 MLlib 组件实现您自己的方法，使用高级 ML API.

可能更容易

编辑:

这里有两个可能的问题。

此时 LogisticRegressionWithSGD 支持 only binomial classification (Thanks to eliasah 指出这一点）。如果需要多标签分类可以换成LogisticRegressionWithLBFGS.
StandardScaler 仅支持密集向量，因此应用有限。

在 Apache Spark 上训练逻辑回归模型时出错。火花-5063

Error with training logistic regression model on Apache Spark. SPARK-5063

python

logistic-regression

apache-spark

pyspark

apache-spark-mllib