如何在 Kubeflow 管道中指定 InputPath 或 OutputPath 的本地路径

Question

我已经开始使用 Kubeflow Pipelines 运行机器学习项目的数据处理、训练和预测，并且我正在使用 InputPath 和 OutputhPath 在组件之间传递大文件。

我想知道如何设置 OutputPath 在组件中查找文件的路径以及 InputPath 在组件中加载文件的位置（如果可能的话）。

目前，代码将它们存储在预先确定的位置（例如 data/my_data.csv），如果我可以 'tell' InputPath/OutputPath 这就是它应该的文件复制，而不必按照下面的最小示例重命名所有文件以匹配 OutputPath 的期望。

@dsl.pipelines(name='test_pipeline')
def pipeline():
    pp = create_component_from_func(func=_pre_process_data)()
    # use pp['pre_processed']...

def pre_process_data(pre_processed_path: OutputPath('csv')):
    import os

    print('do some processing which saves file to data/pre_processed.csv')

    # want to avoid this:
    print('move files to OutputPath locations...')
    os.rename(f'data/pre_processed.csv', pre_processed_path)

自然地，我不想更新代码以遵守 Kubeflow 管道命名约定，因为这对我来说似乎是非常糟糕的做法。

谢谢！

Answer 1

更新 - 请参阅 ark-kun 的评论，我原来的答案中的方法已弃用，不应使用。最好让 Kubeflow Pipelines 指定您应该在哪里存储管道的工件。

对于轻量级组件（例如您示例中的组件），Kubeflow Pipelines 会为您的组件构建容器镜像并指定输入和输出的路径（基于您用来装饰组件功能的类型）。我建议直接使用这些路径，而不是写入一个位置然后重命名文件。 Kubeflow Pipelines samples 遵循这种模式。

对于reusable components, you define the pipeline inputs and outputs as part of the YAML specification for the component。在这种情况下，您可以指定输出文件的首选位置。也就是说，可重用组件的创建需要更多的努力，因为您需要在 YAML 中构建 Docker 容器映像和组件规范。

如何在 Kubeflow 管道中指定 InputPath 或 OutputPath 的本地路径

How can you specify local path of InputPath or OutputPath in Kubeflow Pipelines

kubeflow-pipelines