"No module named 'pandas' " 将 pyspark pandas_udf 与 AWS EMR 一起使用时发生错误
"No module named 'pandas' " error occurs when using pyspark pandas_udf with AWS EMR
我在 AWS EMR 上使用 zeppelin 运行 此站点 (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#co-grouped-map) 的代码。
%pyspark
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
("time", "id", "v1"))
df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("time", "id", "v2"))
def asof_join(l, r):
return pd.merge_asof(l, r, on="time", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string").show()
并在最后一行 运行 出现“ModuleNotFoundError:没有名为 'pandas' 的模块”错误。
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string").show()
> pyspark.sql.utils.PythonException: An exception was thrown from
> Python worker in the executor. The below is the Python worker
> stacktrace. Traceback (most recent call last): File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 589, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 434, in read_udfs
> arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0) File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 254, in read_single_udf
> f, return_type = read_command(pickleSer, infile) File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 74, in read_command
> command = serializer._read_with_length(file) File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/serializers.py",
> line 172, in _read_with_length
> return self.loads(obj) File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/serializers.py",
> line 458, in loads
> return pickle.loads(obj, encoding=encoding) File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/cloudpickle.py",
> line 1110, in subimport
> __import__(name)
> ModuleNotFoundError: No module named 'pandas'
您使用的库版本如下
“pyspark 3.0.0
火花 3.0.0
pyarrow 0.15.1
飞艇 0.9.0"
并将 zeppelin.pyspark.python 配置 属性 设置为 python3
由于原来的EMR环境中没有安装pandas,所以我用命令“sudo python3 -m pip install pandas”进行安装。
我已经确认,如果我 运行 在 zeppelin 上输入代码“import pandas”,它 运行 没问题。
但是,当我从 pyspark 使用 pandas_udf 时,出现错误 pandas 找不到。
为什么是这样?我怎样才能正确地做到这一点?
将“sudo python3 -m install pandas”写入 shell 脚本以进行引导操作可解决此问题。
我在 AWS EMR 上使用 zeppelin 运行 此站点 (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#co-grouped-map) 的代码。
%pyspark
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
("time", "id", "v1"))
df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("time", "id", "v2"))
def asof_join(l, r):
return pd.merge_asof(l, r, on="time", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string").show()
并在最后一行 运行 出现“ModuleNotFoundError:没有名为 'pandas' 的模块”错误。 df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( asof_join, schema="time int, id int, v1 double, v2 string").show()
> pyspark.sql.utils.PythonException: An exception was thrown from
> Python worker in the executor. The below is the Python worker
> stacktrace. Traceback (most recent call last): File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 589, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 434, in read_udfs
> arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0) File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 254, in read_single_udf
> f, return_type = read_command(pickleSer, infile) File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 74, in read_command
> command = serializer._read_with_length(file) File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/serializers.py",
> line 172, in _read_with_length
> return self.loads(obj) File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/serializers.py",
> line 458, in loads
> return pickle.loads(obj, encoding=encoding) File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/cloudpickle.py",
> line 1110, in subimport
> __import__(name)
> ModuleNotFoundError: No module named 'pandas'
您使用的库版本如下 “pyspark 3.0.0 火花 3.0.0 pyarrow 0.15.1 飞艇 0.9.0" 并将 zeppelin.pyspark.python 配置 属性 设置为 python3
由于原来的EMR环境中没有安装pandas,所以我用命令“sudo python3 -m pip install pandas”进行安装。 我已经确认,如果我 运行 在 zeppelin 上输入代码“import pandas”,它 运行 没问题。
但是,当我从 pyspark 使用 pandas_udf 时,出现错误 pandas 找不到。 为什么是这样?我怎样才能正确地做到这一点?
将“sudo python3 -m install pandas”写入 shell 脚本以进行引导操作可解决此问题。