你为什么不 运行 Kubernetes pods 从 Composer 学习一个多小时?

Why shouldn't you run Kubernetes pods for longer than an hour from Composer?

The Cloud Composer documentation 明确指出:

Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.

但是,它没有提供比这更多的上下文,而且我在 Kubernetes Python 客户端项目上找不到明确相关的问题。

为了测试它,我 运行 一个 pod 使用了两个小时,没有发现任何问题。什么问题造成了这个限制,它是如何表现出来的?

我对 Cloud Composer 或 Kubernetes Python 客户端库生态系统都不是很熟悉,但是按大多数评论对 GitHub 问题跟踪器进行排序显示这个未解决的项目靠近列表顶部: https://github.com/kubernetes-client/python/issues/492

听起来好像存在令牌过期问题:

@yliaog this is an issue for us, as we are running kubernetes pods as batch processes and tracking the state of the pods with a static client. Once the client object is initialized, it does no refresh, and therefore any job that takes longer than 60 minutes will fail. Looking through python-base, it seems like we could make a wrapper class that generates a new client (or refreshes the config) every n minutes, or checks status prior to every call (as @mvle suggested). The best fix would be in swagger-codegen, but a temporary solution would probably be very useful for a lot of people.

- @flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140

有更多见解here too

Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.

您可以看到 here that Snakemake 使用 kubernetes python 客户端。

https://issues.apache.org/jira/browse/AIRFLOW-3253 是原因(希望我的修复会很快合并)。正如其他人所建议的那样,这会影响使用带有 GCP 身份验证的 Kubernetes Python 客户端的任何人。如果您使用 Kubernetes 服务帐户进行身份验证,则应该没有问题。

如果您使用 gcloud 通过 GCP 服务帐户进行身份验证(例如使用 GKEPodOperator),您通常会在需要超过一个小时的作业中看到此问题,因为身份验证令牌将在一小时后过期。