在 EMR notebook jupyter 中设置 spark.driver.maxResultSize

Question

我在 emr 中使用 Jupyter notebook 来处理大块数据。在处理数据时我看到这个错误：

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 108 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

看来我需要更新 spark 配置中的 maxResultsSize。如何从 jupyter notebook 设置 spark maxResultsSize。

已经检查过post：

另外，在emr notebook中，已经给出了spark context，有什么办法可以编辑spark context并增加maxResultsSize

任何线索都会很有帮助。

谢谢

Answer 1

你可以在spark会话开始时设置livy配置参见 https://github.com/cloudera/livy#request-body

将其放在代码的开头

%%configure -f
{"conf":{"spark.driver.maxResultSize":"15G"}}

通过在下一个单元格中打印来检查您的会话设置：

print(spark.conf.get('spark.driver.maxResultSize'))

这应该可以解决问题

在 EMR notebook jupyter 中设置 spark.driver.maxResultSize

Setting spark.driver.maxResultSize in EMR notebook jupyter

amazon-emr

apache-spark

spark-notebook

jupyter-notebook