Spark returning Pickle error: cannot lookup attribute

Question

我在尝试在我的 RDD 中启动 class 时遇到了一些属性查找问题。

我的工作流程：

1- 从 RDD 开始

2-取RDD的每一个元素，为每个元素初始化一个对象

3-reduce（后面会写一个方法来定义reduce操作）

这是#2：

>class test(object):
def __init__(self, a,b):
    self.total = a + b

>a = sc.parallelize([(True,False),(False,False)])
>a.map(lambda (x,y): test(x,y))

这是我得到的错误：

PicklingError: Can't pickle < class 'main.test' >: attribute lookup main.test failed

我想知道是否有解决办法。请回答一个工作示例以实现预期结果（即创建 class "tests" 对象的 RDD）。

相关问题：

https://groups.google.com/forum/#!topic/edx-code/9xzRJFyQwnI
Can't pickle <type 'instancemethod'> when using python's multiprocessing Pool.map()

Answer 1

来自 Davies Liu (DataBricks)：

“目前，PySpark 不支持 pickle 当前的 class 对象脚本（'main'），解决方法可以放在实现中的 class 到一个单独的模块，然后使用“bin/spark-submit --py-files xxx.py" 在部署它。

在xxx.py中：

class test(object):
     def __init__(self, a, b):
        self.total = a + b

在job.py中：

from xxx import test
a = sc.parallelize([(True,False),(False,False)])
a.map(lambda (x,y): test(x,y))

运行它来自：

bin/spark-submit --py-files xxx.py job.py

"

只是想指出您也可以在 Spark Shell 中传递相同的参数 (--py-files)。

Spark returning Pickle error: cannot lookup attribute

Spark returning Pickle error: cannot lookup attribute

python

pickle

apache-spark