Kfserving -- 定义 storageUri 时出错

Kfserving -- Error When Defining storageUri

我正在尝试使用 Kfserving 部署一个非常基本的 Sklearn 模型,这是 yaml 文件:

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  default:
    predictor:
      sklearn:
        storageUri: file://./storage_dir

请注意,由于我们的公司环境无法访问 Google 云存储,现在我只使用我的本地文件夹之一作为 storageUri,并且我有 model.joblib 存放在文件夹中。

使用 kubectl apply -f sklearn.yaml -n kfserving-test 部署后,检查 kubectl describe revision sklearn-iris-predictor-default-fj5qt -n kfserving-test 时出现以下错误:

Status:
  Conditions:
    Last Transition Time:  2020-12-16T22:51:38Z
    Message:               The target is not receiving traffic.
    Reason:                NoTraffic
    Severity:              Info
    Status:                False
    Type:                  Active
    Last Transition Time:  2020-12-16T22:51:37Z
    Message:               Container failed with: [I 201216 22:50:07 storage:35] Copying contents of /mnt/models to local
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/sklearnserver/sklearnserver/__main__.py", line 33, in <module>
    model.load()
  File "/sklearnserver/sklearnserver/model.py", line 36, in load
    model_file = next(path for path in paths if os.path.exists(path))
StopIteration

    Reason:                ExitCode1
    Status:                False
    Type:                  ContainerHealthy
    Last Transition Time:  2020-12-16T22:51:38Z
    Message:               Initial scale was never achieved
    Reason:                ProgressDeadlineExceeded
    Status:                False
    Type:                  Ready
    Last Transition Time:  2020-12-16T22:51:38Z
    Message:               Initial scale was never achieved
    Reason:                ProgressDeadlineExceeded
    Status:                False
    Type:                  ResourcesAvailable
  Container Statuses:
    Image Digest:       gcr.docker.prod.walmart.com/kfserving/sklearnserver@sha256:d2553d3f2a6ba7b50736028e6dbdfb35e90ca40ee7aa5cbe0e0b66fec1695f16
    Name:               kfserving-container
  Image Digest:         gcr.docker.prod.walmart.com/kfserving/sklearnserver@sha256:d2553d3f2a6ba7b50736028e6dbdfb35e90ca40ee7aa5cbe0e0b66fec1695f16
  Log URL:              http://localhost:8001/api/v1/namespaces/knative-monitoring/services/kibana-logging/proxy/app/kibana#/discover?_a=(query:(match:(kubernetes.labels.knative-dev%2FrevisionUID:(query:'e6fee737-b9b8-4091-96a5-660dbf4082f8',type:phrase))))
  Observed Generation:  1
  Service Name:         sklearn-iris-predictor-default-fj5qt
Events:
  Type     Reason         Age    From                 Message
  ----     ------         ----   ----                 -------
  Warning  InternalError  2m16s  revision-controller  failed to update deployment "sklearn-iris-predictor-default-fj5qt-deployment": Operation cannot be fulfilled on deployments.apps "sklearn-iris-predictor-default-fj5qt-deployment": the object has been modified; please apply your changes to the latest version and try again

异常似乎无法 load/transfer 模型文件,我想知道我对 storageUri 参数做错了什么。应该是模型文件的相对路径吧? (参考:https://github.com/kubeflow/kfserving/blob/master/python/kfserving/docs/V1alpha2SKLearnSpec.md

KFServing 正在预测器的 pod 中注入第二个容器,在您的例子中为 SKLearn,称为 storage_initializer。它的作用是将模型文件从 storageUri 下载并复制到 pod 中的某个位置,以从此类任务中卸载预测器。

在构建 KFServing 时,使用 storageUri 中的 file:// 可以方便地进行测试,但它需要 pod 已经在本地安装了文件。

如果您无法访问 gs://s3:// 等基于云的存储,您可以使用其中一种替代解决方案,例如 uri://pvc://,从本地 kubernetes 集群提供模型文件。你可以找到 examples here.