使用 upload_jars 函数获取 GeoSpark 错误
Getting GeoSpark error with upload_jars function
我正在尝试 运行 AWS EMR 集群中的 GeoSpark。代码是:
# coding=utf-8
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
from geospark.register import GeoSparkRegistrator
from geospark.utils import GeoSparkKryoRegistrator
from geospark.register import upload_jars
import config as cf
import yaml
if __name__ == "__main__":
# Read files
with open("/tmp/param.yml", 'r') as ymlfile:
param = yaml.load(ymlfile, Loader=yaml.SafeLoader)
# Register jars
upload_jars()
# Creation of spark session
print("Creating Spark session")
spark = SparkSession \
.builder \
.getOrCreate()
GeoSparkRegistrator.registerAll(spark)
我在 upload_jars()
函数中遇到以下错误:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/findspark.py", line 143, in init
py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "geo_processing.py", line 21, in <module>
upload_jars()
File "/usr/local/lib/python3.7/site-packages/geospark/register/uploading.py", line 39, in upload_jars
findspark.init()
File "/usr/local/lib/python3.7/site-packages/findspark.py", line 146, in init
"Unable to find py4j, your SPARK_HOME may not be configured correctly"
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
如何解决这个错误?
解决方案
您应该从您的代码中删除 upload_jars()
并以另一种方式加载 jar,或者将它们复制到 SPARK_HOME
(从 emr-4.0.0 开始在 /usr/lib/spark
) 作为 EMR bootstrap 操作的一部分或在您的 spark-submit
命令中使用 --jars
选项。
说明
我无法让 upload_jars()
函数在多节点 EMR 集群上运行。根据 geospark documentation, upload_jars()
:
uses findspark Python package to upload jar files to executor and nodes. To avoid copying all the time, jar files can be put in directory SPARK_HOME/jars or any other path specified in Spark config files.
Spark 在 EMR 上以 YARN 模式安装,这意味着它只安装在主节点上,而不是 core/task 节点上。因此,findspark
不会在 core/task 节点上找到 Spark,因此您会收到错误 Unable to find py4j, your SPARK_HOME may not be configured correctly
.
我正在尝试 运行 AWS EMR 集群中的 GeoSpark。代码是:
# coding=utf-8
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
from geospark.register import GeoSparkRegistrator
from geospark.utils import GeoSparkKryoRegistrator
from geospark.register import upload_jars
import config as cf
import yaml
if __name__ == "__main__":
# Read files
with open("/tmp/param.yml", 'r') as ymlfile:
param = yaml.load(ymlfile, Loader=yaml.SafeLoader)
# Register jars
upload_jars()
# Creation of spark session
print("Creating Spark session")
spark = SparkSession \
.builder \
.getOrCreate()
GeoSparkRegistrator.registerAll(spark)
我在 upload_jars()
函数中遇到以下错误:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/findspark.py", line 143, in init
py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "geo_processing.py", line 21, in <module>
upload_jars()
File "/usr/local/lib/python3.7/site-packages/geospark/register/uploading.py", line 39, in upload_jars
findspark.init()
File "/usr/local/lib/python3.7/site-packages/findspark.py", line 146, in init
"Unable to find py4j, your SPARK_HOME may not be configured correctly"
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
如何解决这个错误?
解决方案
您应该从您的代码中删除 upload_jars()
并以另一种方式加载 jar,或者将它们复制到 SPARK_HOME
(从 emr-4.0.0 开始在 /usr/lib/spark
) 作为 EMR bootstrap 操作的一部分或在您的 spark-submit
命令中使用 --jars
选项。
说明
我无法让 upload_jars()
函数在多节点 EMR 集群上运行。根据 geospark documentation, upload_jars()
:
uses findspark Python package to upload jar files to executor and nodes. To avoid copying all the time, jar files can be put in directory SPARK_HOME/jars or any other path specified in Spark config files.
Spark 在 EMR 上以 YARN 模式安装,这意味着它只安装在主节点上,而不是 core/task 节点上。因此,findspark
不会在 core/task 节点上找到 Spark,因此您会收到错误 Unable to find py4j, your SPARK_HOME may not be configured correctly
.