Spark 模块中 Parallelize 方法的正确用法是什么 pyspark.mllib.classification

Question

运行来自笔记本的 Databricks 社区版 spark 集群 UI
尝试为小数据样本训练 NaiveBayes 时遇到此错误 - TypeError: unbound method parallelize() must be called with SparkContext 实例作为第一个参数（取而代之的是列表实例）

代码：

from pyspark.mllib.classification import LabeledPoint, NaiveBayes
from pyspark import SparkContext as sc
data = [
LabeledPoint(0.0, [0.0, 0.0]),
LabeledPoint(0.0, [0.0, 1.0]),
LabeledPoint(1.0, [1.0, 0.0])]
model = NaiveBayes.train(sc.parallelize(data))
model.predict(array([0.0, 1.0]))
model.predict(array([1.0, 0.0]))
model.predict(sc.parallelize([[1.0, 0.0]])).collect()

Answer 1

这里的问题是在你的例子的第二行导入：

from pyspark import SparkContext as sc

这是用 SparkContext class 覆盖内置 SparkContext 实例（存储在 sc 中），导致稍后的 sc.parallelize() 调用失败。

在Databricks中，你不需要自己创建SparkContext；它在 Databricks 笔记本中自动预定义为 sc。有关 Databricks 中预定义变量的更完整列表，请参阅 https://docs.databricks.com/user-guide/getting-started.html#predefined-variables。

Spark 模块中 Parallelize 方法的正确用法是什么 pyspark.mllib.classification

What is the correct usage of the method Parallelize in the Spark module pyspark.mllib.classification

apache-spark-mllib