在 Apache Spark 上训练逻辑回归模型时出错。火花-5063
Error with training logistic regression model on Apache Spark. SPARK-5063
我正在尝试使用 Apache Spark 构建逻辑回归模型。
这是代码。
parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)
但是我得到这个错误:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
我不确定如何解决这个问题。任何帮助将不胜感激。
您看到的问题与我在 中描述的问题几乎相同 要进行转换,您必须调用 Scala 函数,并且它需要访问 SparkContext
因此出现错误见。
处理此问题的标准方法是仅处理数据的必需部分,然后压缩结果。
labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)
scaledData = (labels
.zip(featuresTransformed)
.map(lambda p: LabeledPoint(p[0], p[1])))
modelScaledSGD = LogisticRegressionWithSGD.train(...)
如果不打算基于 MLlib
组件实现您自己的方法,使用高级 ML
API.
可能更容易
编辑:
这里有两个可能的问题。
- 此时
LogisticRegressionWithSGD
支持 only binomial classification (Thanks to eliasah 指出这一点)。如果需要多标签分类可以换成LogisticRegressionWithLBFGS
.
StandardScaler
仅支持密集向量,因此应用有限。
我正在尝试使用 Apache Spark 构建逻辑回归模型。 这是代码。
parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)
但是我得到这个错误:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
我不确定如何解决这个问题。任何帮助将不胜感激。
您看到的问题与我在 SparkContext
因此出现错误见。
处理此问题的标准方法是仅处理数据的必需部分,然后压缩结果。
labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)
scaledData = (labels
.zip(featuresTransformed)
.map(lambda p: LabeledPoint(p[0], p[1])))
modelScaledSGD = LogisticRegressionWithSGD.train(...)
如果不打算基于 MLlib
组件实现您自己的方法,使用高级 ML
API.
编辑:
这里有两个可能的问题。
- 此时
LogisticRegressionWithSGD
支持 only binomial classification (Thanks to eliasah 指出这一点)。如果需要多标签分类可以换成LogisticRegressionWithLBFGS
. StandardScaler
仅支持密集向量,因此应用有限。