在 yarn-cluster 模式下运行时如何使用 REST 调用获取 Spark Streaming 作业统计信息

Question

我在“yarn-cluster”模式下的 Yarn Cluster 上有一个 spark streaming 程序运行ning。 (-master yarn-cluster)。我想使用 json 格式的 REST APIs 获取 spark 作业统计信息。我可以使用 REST url 调用获取基本统计信息： http://yarn-cluster:8088/proxy/application_1446697245218_0091/metrics/json。但这是给出非常基本的统计数据。

但是我想获取每个执行者或每个基于 RDD 的统计信息。 如何使用 REST 调用来做到这一点，以及我在哪里可以找到确切的 REST url 来获取这些统计信息。尽管 $SPARK_HOME/conf/metrics.properties 文件揭示了一些关于 url 的信息，即

5. MetricsServlet is added by default as a sink in master, worker and client driver, you can send http request "/metrics/json" to get a snapshot of all the registered metrics in json format. For master, requests "/metrics/master/json" and "/metrics/applications/json" can be sent seperately to get metrics snapshot of instance master and applications. MetricsServlet may not be configured by self.

但那是获取 html 页而不是 json。只有“/metrics/json”以 json 格式获取统计信息。最重要的是，当运行在 yarn-cluster 模式下以编程方式了解 application_id 本身就是一个挑战。

我检查了 Spark Monitoring page 的 REST API 部分，但是当我们运行在 yarn-cluster 中激发作业时，这没有用模式。欢迎任何pointers/answers。

Answer 1

您应该能够访问 Spark REST API 使用：

http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/

从这里您可以 select 列表中的 app-id 然后使用以下端点获取有关执行者的信息，例如：

http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/{app-id}/执行者

我用我的火花流应用程序验证了这一点，该应用程序在 yarn 集群模式下运行。

我将解释我是如何使用网络浏览器得到 JSON 响应的。（这是针对 yarn-cluster 模式下的 Spark 1.5.2 流应用程序）。

首先，使用hadoop url 查看运行应用程序。 http://{yarn-cluster}:8088/cluster/apps/RUNNING。

接下来，select 一个运行应用程序，比如 http://{yarn-cluster}:8088/cluster/app/application_1450927949656_0021。

接下来，单击 TrackingUrl link。这使用代理并且端口在我的例子中不同：http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/。这显示了火花 UI。现在，将 api/v1/applications 附加到此 URL：http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/api/v1/applications。

您应该会看到 JSON 响应，其中包含提供给 SparkConf 的应用程序名称和应用程序的启动时间。

Answer 2

您需要浏览 HTML 页面以获取相关指标。没有用于捕获此信息的 Spark 休息端点。

Answer 3

我能够使用 /jobs/ 端点重建在 Spark Streaming web UI 中看到的列中的指标（批处理开始时间、处理延迟、调度延迟）。

我使用的脚本可用 here. I wrote a short post 描述并将其功能绑定回 Spark 代码库。这不需要任何网络抓取。

它适用于 Spark 2.0.0 和 YARN 2.7.2，但也适用于其他版本组合。

在 yarn-cluster 模式下运行时如何使用 REST 调用获取 Spark Streaming 作业统计信息

How to fetch Spark Streaming job statistics using REST calls when running in yarn-cluster mode

hadoop-yarn

spark-streaming

在 yarn-cluster 模式下 运行 时如何使用 REST 调用获取 Spark Streaming 作业统计信息

How to fetch Spark Streaming job statistics using REST calls when running in yarn-cluster mode

hadoop-yarn

spark-streaming

在 yarn-cluster 模式下运行时如何使用 REST 调用获取 Spark Streaming 作业统计信息