运行 Apache Livy 中的 PySpark 代码通过 lambda 函数从 AWS EMR 引起的模块错误

Module error caused from AWS EMR by running PySpark code in Apache Livy via lambda function

我是 运行 AWS EMR 集群中的 pyspark 代码。我通过 lambda 函数在 livy 应用程序中赋予了 spark 属性。

import requests
import json

def lambda_handler(event, context):
  master_dns = event.get('clusterDetails', {}).get('Cluster', {}).get('MasterPublicDnsName')

  headers = { "content-type": "application/json" }

  url = "http://" + master_dns + ":8998/batches"
  print(url)
  payload = {
    "file" : "s3://dtrack-test/epay/usap/USAPPIDBAL/scripts/spark_wc.py",
    "args" : ["s3://dtrack-test/epay/usap/USAPPIDBAL/raw_data/sample-test.txt","s3://dtrack-test/epay/usap/USAPPIDBAL/sample-op/"]
  }
  res = requests.post(url, data = json.dumps(payload), headers = headers, verify = False)
  json_data = json.loads(res.text)
  return json_data 

但导致以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 49, ip-172-31-16-64.ap-south-1.compute.internal, executor 1): org.apache.spark.SparkException:
Error from python worker:
  /usr/bin/python3: Error while finding module specification for 'pyspark.daemon' (ModuleNotFoundError: No module named 'pyspark')
PYTHONPATH was:
  /mnt/yarn/usercache/livy/filecache/10/__spark_libs__1402648699103959205.zip/spark-core_2.11-2.4.5-amzn-0.jar
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout

我已将配置 livy.master 设置为本地,当我删除此配置时一切正常。