Kubernetes MLflow 服务 Pod 连接

Kubernetes MLflow Service Pod Connection

我已经将 build 的 mlflow 部署到我的 kubernetes 集群中的一个 pod。我能够转发到 mlflow ui,现在我正在尝试测试它。为此,我 运行 在 jupyter notebook 上 运行ning 在同一集群中的另一个 pod 上进行以下测试。

import mlflow

print("Setting Tracking Server")
tracking_uri = "http://mlflow-tracking-server.default.svc.cluster.local:5000"

mlflow.set_tracking_uri(tracking_uri)

print("Logging Artifact")
mlflow.log_artifact('/home/test/mlflow-example-artifact.png')

print("DONE")

当我 运行 这样做时,我得到

ConnectionError: HTTPConnectionPool(host='mlflow-tracking-server.default.svc.cluster.local', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/get? (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object>: Failed to establish a new connection: [Errno 111] Connection refused'))

我部署 mlflow pod 的方式如下所示,在 yaml 和 docker:

Yaml:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking-server
  namespace: default
spec:
  selector:
    matchLabels:
      app: mlflow-tracking-server
  replicas: 1
  template:
    metadata:
      labels:
        app: mlflow-tracking-server
    spec:
      containers:
      - name: mlflow-tracking-server
        image: <ECR_IMAGE>
        ports:
        - containerPort: 5000
        env:
        - name: AWS_MLFLOW_BUCKET
          value: <S3_BUCKET>
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_ACCESS_KEY_ID
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_SECRET_ACCESS_KEY

---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-tracking-server
  namespace: default
  labels:
    app: mlflow-tracking-server
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  externalTrafficPolicy: Local
  type: LoadBalancer
  selector:
    app: mlflow-tracking-server
  ports:
    - name: http
      port: 5000
      targetPort: http

当 docker 文件调用执行 mlflow 服务器命令的脚本时:mlflow server --default-artifact-root ${AWS_MLFLOW_BUCKET} --host 0.0.0.0 --port 5000,我无法连接到我使用该 mlflow pod 创建的服务。

我试过使用跟踪 uri http://mlflow-tracking-server.default.svc.cluster.local:5000,我试过使用服务 EXTERNAL-IP:5000,但我尝试的一切都无法使用该服务连接和登录。在将我的 mlflow 服务器 pod 部署到我的 kubernetes 集群时,我错过了什么吗?

您的 mlflow-tracking-server 服务应该有 ClusterIP 类型,而不是 LoadBalancer

两者 pods 都在同一个 Kubernetes 集群中,因此,没有理由使用 LoadBalancer 服务类型。

For some parts of your application (for example, frontends) you may want to expose a Service onto an external IP address, that’s outside of your cluster. Kubernetes ServiceTypes allow you to specify what kind of Service you want. The default is ClusterIP.

Type values and their behaviors are:

  • ClusterIP: Exposes the Service on a cluster-internal IP. Choosing this value makes the Service only reachable from within the cluster. This is the default ServiceType.

  • NodePort: Exposes the Service on each Node’s IP at a static port (the NodePort). A > ClusterIP Service, to which the NodePort Service routes, is automatically created. You’ll > be able to contact the NodePort Service, from outside the cluster, by requesting :.

  • LoadBalancer: Exposes the Service externally using a cloud provider’s load balancer. NodePort and ClusterIP Services, to which the external load balancer routes, are automatically created.
  • ExternalName: Maps the Service to the contents of the externalName field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set up.

kubernetes.io

所以为了过于简单化,您无法从 jupyterhub pod 访问 mlflow uri。我在这里要做的是检查 jupyterhub pod 的代理。如果 NO_PROXY 中没有 .svc,则必须添加它。一个详细的原因是您正在访问内部 .svc mlflow url 就像它在开放的互联网上一样。但实际上您的 mlflow uri 只能在集群内部访问。如果添加 .svc 不起作用,因为没有代理不起作用,我们可以对其进行更深入的研究。检查代理的方法是使用‘kubectl get po $JHPODNAME -n $ JHNamespace -o yaml’