'Broken pipe' 异常会取消我的工作吗?

Does a 'Broken pipe' exception cancel my job?

目前我是 运行 一个使用 144 个 TaskSlots 的 4 台机器远程集群上的 Flink 程序。 运行 大约 30 分钟后,我收到以下错误:

INFO org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet - Info server for jobmanager: Failed to write json updates for job b2eaff8539c8c9b696826e69fb40ca14, because org.eclipse.jetty.io.RuntimeIOException: org.eclipse.jetty.io.EofException at org.eclipse.jetty.io.UncheckedPrintWriter.setError(UncheckedPrintWriter.java:107) at org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:280) at org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:295) at org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.writeJsonUpdatesForJob(JobManagerInfoServlet.java:588) at org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.doGet(JobManagerInfoServlet.java:209) at javax.servlet.http.HttpServlet.service(HttpServlet.java:734) at javax.servlet.http.HttpServlet.service(HttpServlet.java:847) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113) at org.eclipse.jetty.server.Server.handle(Server.java:352) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596) at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211) at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489) at org.eclipse.jetty.util.thread.QueuedThreadPool.run(QueuedThreadPool.java:436) at java.lang.Thread.run(Thread.java:745) Caused by: org.eclipse.jetty.io.EofException at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:905) at org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:427) at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:78) at org.eclipse.jetty.server.HttpConnection$Output.flush(HttpConnection.java:1139) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:159) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:86) at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:154) at org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:258) at org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:107) at org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:271) ... 24 more Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:185) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:256) at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:849) ... 33 more

我知道 java.io.IOException: Broken pipe 意味着 JobManager 丢失了某种连接,所以我猜整个作业都失败了,我必须重新启动它。尽管我认为该过程不再是 运行,但 WebInterface 仍将其列为 运行。此外,当我使用 jps 来识别集群上的 运行 进程时,JobManager 仍然存在。所以我的问题是,如果我的工作丢失了,这个错误是有时随机发生还是我的程序引起的。

编辑:我的 TaskManager 仍然每隔几秒发送一次心跳并且似乎是 运行。

实际上是 Flink 的 Web 服务器 JobManagerInfoServlet 的问题,由于 java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method),它无法将所请求作业的最新 JSON 更新发送到您的浏览器。因此,只有对服务器的 GET 请求失败。

这样的故障应该不会影响当前运行 Flink作业的执行。只需刷新您的浏览器(使用 Flink 的网络 UI)应该会发送另一个 GET 请求,然后有望成功完成。