如何将依赖文件传递给 sagemaker SKLearnProcessor 并在 Pipeline 中使用它？

Question

我需要从不同的 python 脚本导入函数，这些函数将在 preprocessing.py 文件中使用。我找不到将相关文件传递给 SKLearnProcessor 对象的方法，因此我得到 ModuleNotFoundError.

代码：

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)


sklearn_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test')],
                      arguments=['--train-test-split-ratio', '0.2']
                     )

我想通过， dependent_files = ['file1.py', 'file2.py', 'requirements.txt']。因此，preprocessing.py 可以访问所有依赖模块。

并且还需要从 requirements.txt 文件安装库。

你能分享任何解决方法或正确的方法吗？

25-11-2021 更新：

Q1.（已回答但希望使用 FrameworkProcessor 解决）

Here, the get_run_args function, is handling dependencies, source_dir and code parameters by using FrameworkProcessor。有什么方法可以让我们从 ScriptProcessor 或 SKLearnProcessor 或任何其他 Processor 设置这些参数？

Q2.

能否请您提供一些参考，将我们的 Processor 用作 sagemaker.workflow.steps.ProcessingStep，然后在 sagemaker.workflow.pipeline.Pipeline 中使用？

对于 Pipeline，我们是否需要 sagemaker-project 作为强制性的，或者我们可以直接创建 Pipeline 而无需任何 Sagemaker-Project？

Answer 1

这在 SKLearnProcessor 中不受支持。您需要将依赖项打包到 docker 映像中并创建自定义 Processor（例如 ScriptProcessor 和您创建的 docker 映像的 image_uri。 )

Answer 2

有几个选项供您完成。

一个非常简单的方法是将所有附加文件添加到一个文件夹中，例如：

.
├── my_package
│   ├── file1.py
│   ├── file2.py
│   └── requirements.txt
└── preprocessing.py

然后将整个文件夹作为同一 /opt/ml/processing/input/code/ 下的另一个输入发送，示例：

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(
    framework_version="0.20.0",
    role=role,
    instance_type="ml.m5.xlarge",
    instance_count=1,
)

sklearn_processor.run(
    code="preprocessing.py",  # <- this gets uploaded as /opt/ml/processing/input/code/preprocessing.py
    inputs=[
        ProcessingInput(source=input_data, destination='/opt/ml/processing/input'),
        # Send my_package as /opt/ml/processing/input/code/my_package/
        ProcessingInput(source='my_package/', destination="/opt/ml/processing/input/code/my_package/")
    ],
    outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

发生的事情是 sagemaker-python-sdk 将把你的参数 code="preprocessing.py" 放在 /opt/ml/processing/input/code/ 下，你将在同一目录下有 my_package/。

编辑：

对于requirements.txt，您可以添加到您的preprocessing.py:

import sys
import subprocess

subprocess.check_call([
    sys.executable, "-m", "pip", "install", "-r",
    "/opt/ml/processing/input/code/my_package/requirements.txt",
])

如何将依赖文件传递给 sagemaker SKLearnProcessor 并在 Pipeline 中使用它？

How to pass dependency files to sagemaker SKLearnProcessor and use it in Pipeline?

python

amazon-web-services

scikit-learn

amazon-sagemaker