Google PubSub returning google.gax.errors.GaxError: GaxError RPC failed caused by ... StatusCode.UNAVAILABLE

Google PubSub returning google.gax.errors.GaxError: GaxError RPC failed caused by ... StatusCode.UNAVAILABLE

我们正在尝试在我们的一个分布式系统上发生事件后对现有主题进行简单的发布。

代码如下:

try:
  dat = data.encode('utf-8')
  topic.publish(dat)
except:
  <code to recover>

如果我们用 except 捕获所有并打印回溯,我们得到:

google.gax.errors.GaxError: GaxError(RPC failed, caused by <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1478711654.067744009","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1478711654.067706801","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]})>

(下面是完整的错误)

查看 http://gcloud-python.readthedocs.io/en/latest/pubsub-topic.html#google.cloud.pubsub.topic.Topic.publish,这个 GAX 错误似乎不是我们应该寻找的东西。然而,如果我们捕获错误并使用指数退避重试,这通常第二次有效。

我发现 this discussion 虽然它说明了 _gax_python 中的一个潜在错误,但它似乎并不相关。关于我们在这里可能做错了什么有什么想法吗?

完整错误:

458    Traceback (most recent call last):
   459      File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
   460        self.run()
   461      File "/usr/lib/python3.5/threading.py", line 862, in run
   462        self._target(*self._args, **self._kwargs)
   463      File "/home/pp/pp/pp/process/uploader.py", line 145, in upload_thread
   464        topic.publish(byte_string)
   465      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/google/cloud/pubsub/topic.py", line 257, in publish
   466        message_ids = api.topic_publish(self.full_name, [message_data])
   467      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/google/cloud/pubsub/_gax.py", line 165, in topic_publish
   468        options=options)
   469      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/google/cloud/gapic/pubsub/v1/publisher_api.py", line 289, in publish
   470        return self._publish(request, options)
   471      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/google/gax/api_callable.py", line 481, in inner
   472        return api_caller(api_call, this_settings, request)
   473      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/google/gax/api_callable.py", line 158, in inner
   474        return a_func(request, **kwargs)
   475      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/google/gax/api_callable.py", line 434, in inner
   476        errors.create_error('RPC failed', cause=exception))
   477      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/future/utils/__init__.py", line 419, in raise_with_traceback
   478        raise exc.with_traceback(traceback)
   479      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/google/gax/api_callable.py", line 430, in inner
   480        return a_func(*args, **kwargs)
   481      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/google/gax/api_callable.py", line 64, in inner
   482        return a_func(*updated_args, **kwargs)
   483      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/grpc/_channel.py", line 481, in __call__
   484        return _end_unary_response_blocking(state, False, deadline)
   485      File "/home/pp/.virtualenvs/cv/lib/python3.5/site-packages/grpc/_channel.py", line 432, in _end_unary_response_blocking
   486        raise _Rendezvous(state, None, None, deadline)
   487    google.gax.errors.GaxError: GaxError(RPC failed, caused by <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1478711654.067744009","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1478711654.067706801","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]})>

您要查找的相关讨论似乎是问题 2683,“Frequent gRPC StatusCode.UNAVAILABLE errors”。

您没有做错任何事情,捕获异常并重试似乎是目前最合适的解决方法。

如果主题是全局变量,它将停止产生错误。使主题成为一个 class 变量并且只实例化它一次 - 只调用此行一次:

topic = pubsub.Client().topic(name)

此外,这似乎只适用于 Python 2.7 - 在 Python 3.6 中重试会稍微减轻疼痛。

禁用 gRPC 对 Python 3.6 有效 - 这可以通过设置环境变量来完成:

ENV GOOGLE_CLOUD_DISABLE_GRPC=true

我设法找到了 "not so pretty" 解决方法。使用在 google.cloud.pubsub_v1.subscriber.policy.thread.Policy.on_exception.

上复制 deadline_exceeded 代码的策略
from google.cloud.pubsub_v1.subscriber.policy.thread import Policy
import grpc

class UnavailableHackPolicy(Policy):
    def on_exception(self, exception):
        """
        There is issue on grpc channel that launch an UNAVAILABLE exception now and then. Until
        that issue is fixed we need to protect our consumer thread from broke.
        https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2683
        """
        unavailable = grpc.StatusCode.UNAVAILABLE
        if getattr(exception, 'code', lambda: None)() in [unavailable]:
            print("¡OrbitalHack! - {}".format(exception))
            return
        return super(UnavailableHackPolicy, self).on_exception(exception)

关于接收消息功能,我有一个类似

的代码
subscriber = pubsub.SubscriberClient(policy_class=UnavailableHackPolicy)
subscription_path = subscriber.subscription_path(project, subscription_name)
subscriber.subscribe(subscription_path, callback=callback, flow_control=flow_control)

问题是,当资源真正不可用时,我们将无法察觉。然而,虽然 GRPC 开发团队设法解决了这个问题,但我们将使用这个解决方法。