Spark on Hive 进度条停留在 10%

Question

最近我们已经升级到 Spark 1.6 并尝试使用 SparkQL 作为 Hive 的默认查询引擎。在与 HiveServer2 相同的机器上添加 Spark Gateway 角色，并启用 Spark On Yarn 服务。但是，当我运行查询如下时：

SET hive.execution.engine=spark;
INSERT OVERWRITE DIRECTORY '/user/someuser/spark_test_job' SELECT country, COUNT(*) FROM country_date GROUP BY country;

我们看到作业已被 Yarn 接受，资源已分配且状态显示为运行ning，但是，它显示了 10% 的恒定进度，并且在 Hue 或 Yarn 中都没有进一步发展 UI。如果我们检查 Spark UI 作业是否完成，我实际上会在 HDFS 上看到一个输出：有人运行遇到过类似问题吗？任何线索如何调试此类行为？我使用 Cloudera CDH 5.12

Answer 1

只是分享我过去的经验。请阅读此 post:

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Hive-on-Spark-tasks-never-finish/td-p/52565

希望对您有所帮助。

Answer 2

找到答案。最近有 patch released 解决了这个问题。在错误和功能之间浮动：

When a Hive session is initiated, and a query is submitted to the Spark processing engine, Hive maintains one or more Spark Executors on the cluster until the session is terminated. The initial setup of the Spark processing engine is time intensive. To avoid the overhead of having to create a new Spark processing engine for each query submitted, Hive maintains a Spark Application Master (YARN Spark Driver) and one or more Spark Executors for each Hive session. The trade-off however is that the Spark components will consume resources on YARN even though they may be in an idle phase, between queries, for long periods of time.

因此，要在不打补丁的情况下解决此问题，您应该终止 Hive 会话或在查询完成后切换回 MapReduce QL 引擎。如果您使用 Hue，您只有第二个选择。

Spark on Hive 进度条停留在 10%

Spark on Hive progress bar stuck at 10%

hadoop

hive

hadoop-yarn

apache-spark

cloudera-cdh