我们如何使用参数化 Dataset/DataPath 输入在 Azure ML 服务上进行批量推理?

How do we do Batch Inferencing on Azure ML Service with Parameterized Dataset/DataPath input?

ParallelRunStep 文档建议如下:

命名输入数据集(DatasetConsumptionConfig class)

path_on_datastore = iris_data.path('iris/')
input_iris_ds = Dataset.Tabular.from_delimited_files(path=path_on_datastore, validate=False)
named_iris_ds = input_iris_ds.as_named_input(iris_ds_name)

刚刚作为输入传递:

distributed_csv_iris_step = ParallelRunStep(
    name='example-iris',
    inputs=[named_iris_ds],
    output=output_folder,
    parallel_run_config=parallel_run_config,
    arguments=['--model_name', 'iris-prs'],
    allow_reuse=False
)

将数据集输入作为参数提交的文档建议如下: 输入是 DatasetConsumptionConfig class 元素

tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)

arguments 以及 inputs

中传递
train_step = PythonScriptStep(
    name="train_step",
    script_name="train_with_dataset.py",
    arguments=["--param2", tabular_ds_consumption],
    inputs=[tabular_ds_consumption],
    compute_target=compute_target,
    source_directory=source_directory)

在使用新参数提交时,我们创建了一个新的 Dataset class:

iris_tabular_ds = Dataset.Tabular.from_delimited_files('some_link')

然后像这样提交:

pipeline_run_with_params = experiment.submit(pipeline, pipeline_parameters={'tabular_ds_param': iris_tabular_ds})

但是,我们如何结合:我们如何将数据集输入作为参数传递给 ParallelRunStep?

如果我们像这样创建一个 DatasetConsumptionConfig class 元素:

tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)

并将其作为参数传递到 ParallelRunStep 中,它将引发错误。

参考文献:

  1. Notebook with Dataset Input Parameter
  2. ParallelRunStep Notebook

AML ParallelRunStep GA 是一种托管解决方案,用于扩展和扩展大型 ML 工作负载,包括批处理推理、训练和大数据处理。详情请查看以下文件。

Overview doc:运行 使用 ParallelRunStep 进行批量推理

Sample notebooks

• 人工智能秀:How to do Batch Inference using AML ParallelRunStep

• 博客:Batch Inference in Azure Machine Learning

对于输入,我们创建数据集 class 个实例:

tabular_ds1 = Dataset.Tabular.from_delimited_files('some_link')
tabular_ds2 = Dataset.Tabular.from_delimited_files('some_link')

ParallelRunStep 产生一个输出文件,我们使用 PipelineData class 创建一个文件夹来存储这个输出:

from azureml.pipeline.core import Pipeline, PipelineData

output_dir = PipelineData(name="inferences", datastore=def_data_store)

ParallelRunStep 依赖于 ParallelRunConfig Class 来包含有关环境、入口脚本、输出文件名和其他必要定义的详细信息:

from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig

parallel_run_config = ParallelRunConfig(
    source_directory=scripts_folder,
    entry_script=script_file,
    mini_batch_size=PipelineParameter(name="batch_size_param", default_value="5"),
    error_threshold=10,
    output_action="append_row",
    append_row_file_name="mnist_outputs.txt",
    environment=batch_env,
    compute_target=compute_target,
    process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
    node_count=2
)

ParallelRunStep 的输入是使用以下代码创建的

tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_ds1)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)

PipelineParameter 帮助我们运行 不同数据集的管道。 ParallelRunStep 将其作为输入使用:

parallelrun_step = ParallelRunStep(
    name="some-name",
    parallel_run_config=parallel_run_config,
    inputs=[ tabular_ds_consumption ],
    output=output_dir,
    allow_reuse=False
)

使用另一个数据集:

pipeline_run_2 = experiment.submit(pipeline, 
                                   pipeline_parameters={"tabular_ds_param": tabular_ds2}
)

目前有一个错误:DatasetConsumptionConfig and PipelineParameter cannot be reused