MLflow Kubernetes Pod 部署
MLflow Kubernetes Pod Deployment
我正在尝试创建一个 kubernetes pod,它将 运行 MLflow 跟踪器将 mlflow 工件存储在指定的 s3 位置。以下是我尝试使用
部署的内容
Docker 文件:
FROM python:3.7.0
RUN pip install mlflow==1.0.0
RUN pip install boto3
RUN pip install awscli --upgrade --user
ENV AWS_MLFLOW_BUCKET aws_mlflow_bucket
ENV AWS_ACCESS_KEY_ID aws_access_key_id
ENV AWS_SECRET_ACCESS_KEY aws_secret_access_key
COPY run.sh /
ENTRYPOINT ["/run.sh"]
# docker build -t seedjeffwan/mlflow-tracking-server:1.0.0 .
# 1.0.0 is current mlflow version
run.sh:
#!/bin/sh
set -e
if [ -z $FILE_DIR ]; then
echo >&2 "FILE_DIR must be set"
exit 1
fi
if [ -z $AWS_MLFLOW_BUCKET ]; then
echo >&2 "AWS_MLFLOW_BUCKET must be set"
exit 1
fi
if [ -z $AWS_ACCESS_KEY_ID ]; then
echo >&2 "AWS_ACCESS_KEY_ID must be set"
exit 1
fi
if [ -z $AWS_SECRET_ACCESS_KEY ]; then
echo >&2 "AWS_SECRET_ACCESS_KEY must be set"
exit 1
fi
mkdir -p $FILE_DIR && mlflow server \
--backend-store-uri $FILE_DIR \
--default-artifact-root s3://${AWS_MLFLOW_BUCKET} \
--host 0.0.0.0 \
--port 5000
mlflow.yaml:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-tracking-server
namespace: default
spec:
selector:
matchLabels:
app: mlflow-tracking-server
replicas: 1
template:
metadata:
labels:
app: mlflow-tracking-server
spec:
containers:
- name: mlflow-tracking-server
image: seedim/mlflow-tracker-service:v1
ports:
- containerPort: 5000
env:
# FILE_DIR can not be mount dir, MLFLOW need a empty dir but mount dir has lost+found
- name: FILE_DIR
value: /mnt/mlflow/manifest
- name: AWS_MLFLOW_BUCKET
value: <aws_s3_bucket>
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-secret
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-secret
key: AWS_SECRET_ACCESS_KEY
volumeMounts:
- mountPath: /mnt/mlflow
name: mlflow-manifest-storage
volumes:
- name: mlflow-manifest-storage
persistentVolumeClaim:
claimName: mlflow-manifest-pvc
---
apiVersion: v1
kind: Service
metadata:
name: mlflow-tracking-server
namespace: default
labels:
app: mlflow-tracking-server
spec:
ports:
- port: 5000
protocol: TCP
selector:
app: mlflow-tracking-server
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: mlflow-manifest-pvc
namespace: default
spec:
storageClassName: gp2
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
然后我正在构建 docker 图像,将其保存到 minikube 环境,然后尝试 运行 kubernetes pod 上的 docker 图像。
当我尝试此操作时,图像 pod 出现 CrashLoopBackOff 错误,使用 yaml 创建的 pod 出现 'pod has unbound immediate PersistentVolumeClaims'。
我正在尝试关注此处的信息 (https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/07_Experiment_Tracking/07_02_MLFlow.ipynb)。
在这种情况下,我做错了什么吗?
谢谢
此处的问题与您的 minikube 集群未配置的持久卷声明有关。
您将需要决定切换到平台管理的 kubernetes 服务或坚持使用 minikube 并手动满足持久卷声明或
有替代解决方案。
最简单的选择是使用 helm charts for mflow installation like this or this。
第一个头盔 chart 已列出要求:
Prerequisites
- Kubernetes cluster 1.10+
- Helm 2.8.0+
- PV provisioner support in the underlying infrastructure.
就像指南中的一样您遵循这一篇需要 PV provisioner 支持。
因此,通过切换到 EKS,您很可能会更轻松地部署带有 s3 工件存储的 mflow。
如果您希望继续使用 minikube,您将需要修改您链接的指南中的 helm chart 值或 yaml 文件,以与您手动配置 PV 兼容。它可能还需要 s3 的权限配置。
第二个头盔chart有以下limitation/feature:
Known limitations of this Chart
I've created this Chart to use it in a production-ready environment in my company. We are using MLFlow with a Postgres backend store.
Therefore, the following capabilities have been left out of the Chart:
- Using persistent volumes as a backend store.
- Using other database engines like MySQL or SQLServer.
您可以尝试安装在minikube上。此设置将导致工件存储在远程数据库中。它仍然需要调整才能连接到 s3。
不管怎么说,minikube 仍然是一个轻量级的 kubernetes 发行版,主要用于学习,所以如果你坚持太久,你最终会达到另一个限制。
希望对您有所帮助。
我正在尝试创建一个 kubernetes pod,它将 运行 MLflow 跟踪器将 mlflow 工件存储在指定的 s3 位置。以下是我尝试使用
部署的内容Docker 文件:
FROM python:3.7.0
RUN pip install mlflow==1.0.0
RUN pip install boto3
RUN pip install awscli --upgrade --user
ENV AWS_MLFLOW_BUCKET aws_mlflow_bucket
ENV AWS_ACCESS_KEY_ID aws_access_key_id
ENV AWS_SECRET_ACCESS_KEY aws_secret_access_key
COPY run.sh /
ENTRYPOINT ["/run.sh"]
# docker build -t seedjeffwan/mlflow-tracking-server:1.0.0 .
# 1.0.0 is current mlflow version
run.sh:
#!/bin/sh
set -e
if [ -z $FILE_DIR ]; then
echo >&2 "FILE_DIR must be set"
exit 1
fi
if [ -z $AWS_MLFLOW_BUCKET ]; then
echo >&2 "AWS_MLFLOW_BUCKET must be set"
exit 1
fi
if [ -z $AWS_ACCESS_KEY_ID ]; then
echo >&2 "AWS_ACCESS_KEY_ID must be set"
exit 1
fi
if [ -z $AWS_SECRET_ACCESS_KEY ]; then
echo >&2 "AWS_SECRET_ACCESS_KEY must be set"
exit 1
fi
mkdir -p $FILE_DIR && mlflow server \
--backend-store-uri $FILE_DIR \
--default-artifact-root s3://${AWS_MLFLOW_BUCKET} \
--host 0.0.0.0 \
--port 5000
mlflow.yaml:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-tracking-server
namespace: default
spec:
selector:
matchLabels:
app: mlflow-tracking-server
replicas: 1
template:
metadata:
labels:
app: mlflow-tracking-server
spec:
containers:
- name: mlflow-tracking-server
image: seedim/mlflow-tracker-service:v1
ports:
- containerPort: 5000
env:
# FILE_DIR can not be mount dir, MLFLOW need a empty dir but mount dir has lost+found
- name: FILE_DIR
value: /mnt/mlflow/manifest
- name: AWS_MLFLOW_BUCKET
value: <aws_s3_bucket>
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-secret
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-secret
key: AWS_SECRET_ACCESS_KEY
volumeMounts:
- mountPath: /mnt/mlflow
name: mlflow-manifest-storage
volumes:
- name: mlflow-manifest-storage
persistentVolumeClaim:
claimName: mlflow-manifest-pvc
---
apiVersion: v1
kind: Service
metadata:
name: mlflow-tracking-server
namespace: default
labels:
app: mlflow-tracking-server
spec:
ports:
- port: 5000
protocol: TCP
selector:
app: mlflow-tracking-server
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: mlflow-manifest-pvc
namespace: default
spec:
storageClassName: gp2
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
然后我正在构建 docker 图像,将其保存到 minikube 环境,然后尝试 运行 kubernetes pod 上的 docker 图像。
当我尝试此操作时,图像 pod 出现 CrashLoopBackOff 错误,使用 yaml 创建的 pod 出现 'pod has unbound immediate PersistentVolumeClaims'。
我正在尝试关注此处的信息 (https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/07_Experiment_Tracking/07_02_MLFlow.ipynb)。
在这种情况下,我做错了什么吗?
谢谢
此处的问题与您的 minikube 集群未配置的持久卷声明有关。
您将需要决定切换到平台管理的 kubernetes 服务或坚持使用 minikube 并手动满足持久卷声明或 有替代解决方案。
最简单的选择是使用 helm charts for mflow installation like this or this。
第一个头盔 chart 已列出要求:
Prerequisites
- Kubernetes cluster 1.10+
- Helm 2.8.0+
- PV provisioner support in the underlying infrastructure.
就像指南中的一样您遵循这一篇需要 PV provisioner 支持。
因此,通过切换到 EKS,您很可能会更轻松地部署带有 s3 工件存储的 mflow。
如果您希望继续使用 minikube,您将需要修改您链接的指南中的 helm chart 值或 yaml 文件,以与您手动配置 PV 兼容。它可能还需要 s3 的权限配置。
第二个头盔chart有以下limitation/feature:
Known limitations of this Chart
I've created this Chart to use it in a production-ready environment in my company. We are using MLFlow with a Postgres backend store.
Therefore, the following capabilities have been left out of the Chart:
- Using persistent volumes as a backend store.
- Using other database engines like MySQL or SQLServer.
您可以尝试安装在minikube上。此设置将导致工件存储在远程数据库中。它仍然需要调整才能连接到 s3。
不管怎么说,minikube 仍然是一个轻量级的 kubernetes 发行版,主要用于学习,所以如果你坚持太久,你最终会达到另一个限制。
希望对您有所帮助。