Why do I got TypeError: cannot pickle '_thread.RLock' object when using pyspark
Why do I got TypeError: cannot pickle '_thread.RLock' object when using pyspark
我正在使用 spark 来处理我的数据,就像这样:
dataframe_mysql = spark.read.format('jdbc').options(
url='jdbc:mysql://xxxxxxx',
driver='com.mysql.cj.jdbc.Driver',
dbtable='(select * from test_table where id > 100) t',
user='xxxxxx',
password='xxxxxx'
).load()
result = spark.sparkContext.parallelize(dataframe_mysql, 1)
但是我从 spark 得到了这个错误:
Traceback (most recent call last):
File "/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py", line 46, in
process()
File "/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py", line 36, in process
result = spark.sparkContext.parallelize(dataframe_mysql, 1).map(func)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py", line 574, in parallelize
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py", line 611, in _serialize_to_jvm
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 143, in _write_with_length
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 427, in dumps
TypeError: cannot pickle '_thread.RLock' object
我是不是用错了?我应该如何使用SparkContext.parallelize来处理DataFrame?
我知道了,dataframe_mysql
已经是一个Dataframe
,如果想得到一个RDD
,就用dataframe.rdd
代替parallelize
我正在使用 spark 来处理我的数据,就像这样:
dataframe_mysql = spark.read.format('jdbc').options(
url='jdbc:mysql://xxxxxxx',
driver='com.mysql.cj.jdbc.Driver',
dbtable='(select * from test_table where id > 100) t',
user='xxxxxx',
password='xxxxxx'
).load()
result = spark.sparkContext.parallelize(dataframe_mysql, 1)
但是我从 spark 得到了这个错误:
Traceback (most recent call last): File "/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py", line 46, in process() File "/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py", line 36, in process result = spark.sparkContext.parallelize(dataframe_mysql, 1).map(func) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py", line 574, in parallelize File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py", line 611, in _serialize_to_jvm File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 143, in _write_with_length File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 427, in dumps TypeError: cannot pickle '_thread.RLock' object
我是不是用错了?我应该如何使用SparkContext.parallelize来处理DataFrame?
我知道了,dataframe_mysql
已经是一个Dataframe
,如果想得到一个RDD
,就用dataframe.rdd
代替parallelize