breaking change to distributed training moving from TF v1.3 to v1.4: "UnavailableError: Trying to connect an http1.x server"
breaking change to distributed training moving from TF v1.3 to v1.4: "UnavailableError: Trying to connect an http1.x server"
使用此行创建用于分布式训练的托管会话时:
with sv.managed_session(server.target, config=config) as sess, sess.as_default():
我在 chief worker 上收到此错误(底部的完整堆栈跟踪):
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
在报告的参数服务器上一切似乎仍然正常:
E1106 11:26:32.844686639 5543 ev_epoll1_linux.c:1051] grpc epoll fd: 8
2017-11-06 11:26:32.851773: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12222}
2017-11-06 11:26:32.851863: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223}
2017-11-06 11:26:32.856802: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12222
我只在使用从源代码构建的新版 tensorflow v1.4 时收到此错误(从 pip 安装时发现同样的问题)。在 v1.3 中一切正常。有谁知道是否进行了重大更改,我假设关于 tensorflow 如何与 grpc 一起工作?
我想知道这是否与 http2 和 http1 有关?我看到 GRPC 似乎在 http2 上与 protobuf 一起工作,这似乎表明它正在尝试与 http1 连接,但仍然不能解释为什么在将 v1.3 升级到 v1.4 时会中断
还有人知道那个错误是什么吗
UnavailableError: Trying to connect an http1.x server
指的是什么或这里可能有什么解决方法?
我正在使用 RedHat Linux 并尝试在同一本地主机上跨进程进行分布式训练...甚至不尝试通过网络。如果有任何想法,我将不胜感激,希望这也能帮助其他遇到同样问题的人。
完整堆栈跟踪:
E1106 11:28:24.383745692 5787 ev_epoll1_linux.c:1051] grpc epoll fd: 8
2017-11-06 11:28:24.391084: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-06 11:28:24.391185: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-06 11:28:24.392285: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server
with target: grpc://localhost:12223
2017-11-06 11:28:37.875632: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable:
Trying to connect an http1.x server
Traceback (most recent call last):
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in
_do_call
return fn(*args)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in
_run_fn
self._extend_graph()
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in
_extend_graph
self._session, graph_def.SerializeToString(), status)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473,
in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1599, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1026, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/opt/pycharm-community-2017.2.3/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "worker.py", line 426, in <module>
main()
File "worker.py", line 418, in main
run(args, server)
File "worker.py", line 174, in run
sess.run(trainer.sync)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in
_run
feed_dict_tensor, options, run_metadata)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in
_do_run
options, run_metadata)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in
_do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
如果您遵循@NoahEisen 的建议并且
export GRPC_VERBOSITY="DEBUG"
您会看到像这样的更多信息:
E1108 17:37:57.085195825 17711 ev_epoll1_linux.c:1051] grpc epoll fd: 5
D1108 17:37:57.085309439 17711 ev_posix.c:111] Using polling engine: epoll1
D1108 17:37:57.085380147 17711 dns_resolver.c:301] Using native dns resolver
I1108 17:37:57.085819333 17711 socket_utils_common_posix.c:223] Disabling AF_INET6 sockets because ::1 is not available.
I1108 17:37:57.086001584 17711 tcp_server_posix.c:322] Failed to add :: listener, the environment may not support IPv6: {"created":"@1510180677.085876868","description":"OS Error","errno":97,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.c","file_line":256,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:12223"}
2017-11-08 17:37:57.092525: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-08 17:37:57.092648: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-08 17:37:57.093435: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12223
D1108 17:38:02.607109518 17830 http_proxy.c:70] userinfo found in proxy URI
I1108 17:38:02.611335569 17807 http_connect_handshaker.c:304] Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
2017-11-08 17:38:02.617814: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: Trying to connect an http1.x server
我在代理后面,但我只是想在本地主机上进行分布式训练。出于某种原因,它尝试通过代理连接,即使 IP 127.0.0.1 应该等同于本地主机,对吗? IE特别注意这部分:
Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
我想这在我的 python 代码中是懒惰的。 如果我在集群规范中明确地将 ps 更改为 "localhost" 而不是 IP 127.0.0.1 一切似乎在 TF1.4 中再次工作因为它没有尝试连接到本地主机通过我的代理服务器(确实,我认为只有 HTTP1.x)。
@PeteWaren - 这是否构成了 tensorflow 或 grpc 中的实际错误?这些注释应该等同于 localhost=127.0.0.1 吗?无论哪种方式,其处理方式已从 TF1.3 更改为 TF1.4
感谢大家的帮助
使用此行创建用于分布式训练的托管会话时:
with sv.managed_session(server.target, config=config) as sess, sess.as_default():
我在 chief worker 上收到此错误(底部的完整堆栈跟踪):
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
在报告的参数服务器上一切似乎仍然正常:
E1106 11:26:32.844686639 5543 ev_epoll1_linux.c:1051] grpc epoll fd: 8
2017-11-06 11:26:32.851773: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12222}
2017-11-06 11:26:32.851863: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223}
2017-11-06 11:26:32.856802: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12222
我只在使用从源代码构建的新版 tensorflow v1.4 时收到此错误(从 pip 安装时发现同样的问题)。在 v1.3 中一切正常。有谁知道是否进行了重大更改,我假设关于 tensorflow 如何与 grpc 一起工作?
我想知道这是否与 http2 和 http1 有关?我看到 GRPC 似乎在 http2 上与 protobuf 一起工作,这似乎表明它正在尝试与 http1 连接,但仍然不能解释为什么在将 v1.3 升级到 v1.4 时会中断
还有人知道那个错误是什么吗
UnavailableError: Trying to connect an http1.x server
指的是什么或这里可能有什么解决方法?
我正在使用 RedHat Linux 并尝试在同一本地主机上跨进程进行分布式训练...甚至不尝试通过网络。如果有任何想法,我将不胜感激,希望这也能帮助其他遇到同样问题的人。
完整堆栈跟踪:
E1106 11:28:24.383745692 5787 ev_epoll1_linux.c:1051] grpc epoll fd: 8
2017-11-06 11:28:24.391084: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-06 11:28:24.391185: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-06 11:28:24.392285: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server
with target: grpc://localhost:12223
2017-11-06 11:28:37.875632: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable:
Trying to connect an http1.x server
Traceback (most recent call last):
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in
_do_call
return fn(*args)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in
_run_fn
self._extend_graph()
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in
_extend_graph
self._session, graph_def.SerializeToString(), status)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473,
in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1599, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1026, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/opt/pycharm-community-2017.2.3/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "worker.py", line 426, in <module>
main()
File "worker.py", line 418, in main
run(args, server)
File "worker.py", line 174, in run
sess.run(trainer.sync)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in
_run
feed_dict_tensor, options, run_metadata)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in
_do_run
options, run_metadata)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in
_do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
如果您遵循@NoahEisen 的建议并且
export GRPC_VERBOSITY="DEBUG"
您会看到像这样的更多信息:
E1108 17:37:57.085195825 17711 ev_epoll1_linux.c:1051] grpc epoll fd: 5
D1108 17:37:57.085309439 17711 ev_posix.c:111] Using polling engine: epoll1
D1108 17:37:57.085380147 17711 dns_resolver.c:301] Using native dns resolver
I1108 17:37:57.085819333 17711 socket_utils_common_posix.c:223] Disabling AF_INET6 sockets because ::1 is not available.
I1108 17:37:57.086001584 17711 tcp_server_posix.c:322] Failed to add :: listener, the environment may not support IPv6: {"created":"@1510180677.085876868","description":"OS Error","errno":97,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.c","file_line":256,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:12223"}
2017-11-08 17:37:57.092525: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-08 17:37:57.092648: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-08 17:37:57.093435: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12223
D1108 17:38:02.607109518 17830 http_proxy.c:70] userinfo found in proxy URI
I1108 17:38:02.611335569 17807 http_connect_handshaker.c:304] Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
2017-11-08 17:38:02.617814: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: Trying to connect an http1.x server
我在代理后面,但我只是想在本地主机上进行分布式训练。出于某种原因,它尝试通过代理连接,即使 IP 127.0.0.1 应该等同于本地主机,对吗? IE特别注意这部分:
Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
我想这在我的 python 代码中是懒惰的。 如果我在集群规范中明确地将 ps 更改为 "localhost" 而不是 IP 127.0.0.1 一切似乎在 TF1.4 中再次工作因为它没有尝试连接到本地主机通过我的代理服务器(确实,我认为只有 HTTP1.x)。
@PeteWaren - 这是否构成了 tensorflow 或 grpc 中的实际错误?这些注释应该等同于 localhost=127.0.0.1 吗?无论哪种方式,其处理方式已从 TF1.3 更改为 TF1.4
感谢大家的帮助