我们如何使用参数化 Dataset/DataPath 输入在 Azure ML 服务上进行批量推理?
How do we do Batch Inferencing on Azure ML Service with Parameterized Dataset/DataPath input?
ParallelRunStep 文档建议如下:
命名输入数据集(DatasetConsumptionConfig
class)
path_on_datastore = iris_data.path('iris/')
input_iris_ds = Dataset.Tabular.from_delimited_files(path=path_on_datastore, validate=False)
named_iris_ds = input_iris_ds.as_named_input(iris_ds_name)
刚刚作为输入传递:
distributed_csv_iris_step = ParallelRunStep(
name='example-iris',
inputs=[named_iris_ds],
output=output_folder,
parallel_run_config=parallel_run_config,
arguments=['--model_name', 'iris-prs'],
allow_reuse=False
)
将数据集输入作为参数提交的文档建议如下:
输入是 DatasetConsumptionConfig
class 元素
tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
在 arguments
以及 inputs
中传递
train_step = PythonScriptStep(
name="train_step",
script_name="train_with_dataset.py",
arguments=["--param2", tabular_ds_consumption],
inputs=[tabular_ds_consumption],
compute_target=compute_target,
source_directory=source_directory)
在使用新参数提交时,我们创建了一个新的 Dataset
class:
iris_tabular_ds = Dataset.Tabular.from_delimited_files('some_link')
然后像这样提交:
pipeline_run_with_params = experiment.submit(pipeline, pipeline_parameters={'tabular_ds_param': iris_tabular_ds})
但是,我们如何结合:我们如何将数据集输入作为参数传递给 ParallelRunStep?
如果我们像这样创建一个 DatasetConsumptionConfig
class 元素:
tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
并将其作为参数传递到 ParallelRunStep 中,它将引发错误。
参考文献:
AML ParallelRunStep GA 是一种托管解决方案,用于扩展和扩展大型 ML 工作负载,包括批处理推理、训练和大数据处理。详情请查看以下文件。
• Overview doc:运行 使用 ParallelRunStep 进行批量推理
对于输入,我们创建数据集 class 个实例:
tabular_ds1 = Dataset.Tabular.from_delimited_files('some_link')
tabular_ds2 = Dataset.Tabular.from_delimited_files('some_link')
ParallelRunStep 产生一个输出文件,我们使用 PipelineData class 创建一个文件夹来存储这个输出:
from azureml.pipeline.core import Pipeline, PipelineData
output_dir = PipelineData(name="inferences", datastore=def_data_store)
ParallelRunStep 依赖于 ParallelRunConfig Class 来包含有关环境、入口脚本、输出文件名和其他必要定义的详细信息:
from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
parallel_run_config = ParallelRunConfig(
source_directory=scripts_folder,
entry_script=script_file,
mini_batch_size=PipelineParameter(name="batch_size_param", default_value="5"),
error_threshold=10,
output_action="append_row",
append_row_file_name="mnist_outputs.txt",
environment=batch_env,
compute_target=compute_target,
process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
node_count=2
)
ParallelRunStep 的输入是使用以下代码创建的
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_ds1)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
PipelineParameter 帮助我们运行 不同数据集的管道。
ParallelRunStep 将其作为输入使用:
parallelrun_step = ParallelRunStep(
name="some-name",
parallel_run_config=parallel_run_config,
inputs=[ tabular_ds_consumption ],
output=output_dir,
allow_reuse=False
)
使用另一个数据集:
pipeline_run_2 = experiment.submit(pipeline,
pipeline_parameters={"tabular_ds_param": tabular_ds2}
)
目前有一个错误:DatasetConsumptionConfig and PipelineParameter cannot be reused
ParallelRunStep 文档建议如下:
命名输入数据集(DatasetConsumptionConfig
class)
path_on_datastore = iris_data.path('iris/')
input_iris_ds = Dataset.Tabular.from_delimited_files(path=path_on_datastore, validate=False)
named_iris_ds = input_iris_ds.as_named_input(iris_ds_name)
刚刚作为输入传递:
distributed_csv_iris_step = ParallelRunStep(
name='example-iris',
inputs=[named_iris_ds],
output=output_folder,
parallel_run_config=parallel_run_config,
arguments=['--model_name', 'iris-prs'],
allow_reuse=False
)
将数据集输入作为参数提交的文档建议如下:
输入是 DatasetConsumptionConfig
class 元素
tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
在 arguments
以及 inputs
train_step = PythonScriptStep(
name="train_step",
script_name="train_with_dataset.py",
arguments=["--param2", tabular_ds_consumption],
inputs=[tabular_ds_consumption],
compute_target=compute_target,
source_directory=source_directory)
在使用新参数提交时,我们创建了一个新的 Dataset
class:
iris_tabular_ds = Dataset.Tabular.from_delimited_files('some_link')
然后像这样提交:
pipeline_run_with_params = experiment.submit(pipeline, pipeline_parameters={'tabular_ds_param': iris_tabular_ds})
但是,我们如何结合:我们如何将数据集输入作为参数传递给 ParallelRunStep?
如果我们像这样创建一个 DatasetConsumptionConfig
class 元素:
tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
并将其作为参数传递到 ParallelRunStep 中,它将引发错误。
参考文献:
AML ParallelRunStep GA 是一种托管解决方案,用于扩展和扩展大型 ML 工作负载,包括批处理推理、训练和大数据处理。详情请查看以下文件。
• Overview doc:运行 使用 ParallelRunStep 进行批量推理
对于输入,我们创建数据集 class 个实例:
tabular_ds1 = Dataset.Tabular.from_delimited_files('some_link')
tabular_ds2 = Dataset.Tabular.from_delimited_files('some_link')
ParallelRunStep 产生一个输出文件,我们使用 PipelineData class 创建一个文件夹来存储这个输出:
from azureml.pipeline.core import Pipeline, PipelineData
output_dir = PipelineData(name="inferences", datastore=def_data_store)
ParallelRunStep 依赖于 ParallelRunConfig Class 来包含有关环境、入口脚本、输出文件名和其他必要定义的详细信息:
from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
parallel_run_config = ParallelRunConfig(
source_directory=scripts_folder,
entry_script=script_file,
mini_batch_size=PipelineParameter(name="batch_size_param", default_value="5"),
error_threshold=10,
output_action="append_row",
append_row_file_name="mnist_outputs.txt",
environment=batch_env,
compute_target=compute_target,
process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
node_count=2
)
ParallelRunStep 的输入是使用以下代码创建的
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_ds1)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
PipelineParameter 帮助我们运行 不同数据集的管道。 ParallelRunStep 将其作为输入使用:
parallelrun_step = ParallelRunStep(
name="some-name",
parallel_run_config=parallel_run_config,
inputs=[ tabular_ds_consumption ],
output=output_dir,
allow_reuse=False
)
使用另一个数据集:
pipeline_run_2 = experiment.submit(pipeline,
pipeline_parameters={"tabular_ds_param": tabular_ds2}
)
目前有一个错误:DatasetConsumptionConfig and PipelineParameter cannot be reused