Azure ML Studio ML 管道 - 异常:找不到临时文件
Azure ML Studio ML Pipeline - Exception: No temp file found
我已经成功地 运行 了 ML Pipeline 实验并毫无问题地发布了 Azure ML Pipeline。当我 运行 在成功 运行 后直接执行以下操作并发布时(即我 运行 使用 Jupyter 连接所有单元格),测试失败!
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 6}})
run_id = response.json()["Id"]
这是 azureml-logs/70_driver_log.txt 中的错误:
[2020-12-10T17:17:50.124303] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
3 items cleaning up...
Cleanup took 0.20258069038391113 seconds
Traceback (most recent call last):
File "driver/amlbi_main.py", line 48, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job_starter.py", line 52, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
异常:找不到临时文件。作业失败。作业应生成临时文件或在此之前失败。请检查日志以了解原因。
以下是 logs/sys/warning.txt 中的错误:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://eastus.experiments.azureml.net/execution/v1.0/subscriptions/***redacted***/resourceGroups/***redacted***/providers/Microsoft.MachineLearningServices/workspaces/***redacted***/experiments/***redacted-experiment-name***/runs/***redacted-run-id***/telemetry
[...]
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:
同URL.
下一个...
当我等待几分钟并重新运行以下内容时code/cell。
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 2}})
run_id = response.json()["Id"]
成功完成!?嗯?(我在这里更改了进程计数,但我认为这没有什么不同)。此外,日志中没有用户错误。
关于这里可能发生的事情有什么想法吗?
提前感谢您提供的任何见解,祝您编码愉快! :)
==========更新#1:==========
运行 在 1 个约 30 万行的文件上。有时这项工作有效,有时却无效。我们已经尝试过许多具有不同配置设置的版本,但有时都会导致失败。将 sklearn 模型更改为使用 n_jobs=1。我们正在为 NLP 工作的文本数据评分。
default_ds = ws.get_default_datastore()
# output dataset
output_dir = OutputFileDatasetConfig(destination=(def_file_store, 'model/results')).register_on_complete(name='model_inferences')
# location of scoring script
experiment_folder = 'model_pipeline'
rit = 60*60*24
parallel_run_config = ParallelRunConfig(
source_directory=experiment_folder,
entry_script="score.py",
mini_batch_size="5",
error_threshold=10,
output_action="append_row",
environment=batch_env,
compute_target=compute_target,
node_count=5,
run_invocation_timeout=rit,
process_count_per_node=1
)
我们的下一个测试将是 - 将每一行数据放入其自己的文件中。我只用 30 行尝试了这个,即 30 个文件,每个文件有 1 个评分记录,但仍然出现相同的错误。这次我把错误阈值改成了1.
2020-12-17 02:26:16,721|ParallelRunStep.ProgressSummary|INFO|112|The ParallelRunStep processed all mini batches. There are 6 mini batches with 30 items. Processed 6 mini batches containing 30 items, 30 succeeded, 0 failed. The error threshold is 1.
2020-12-17 02:26:16,722|ParallelRunStep.Telemetry|INFO|112|Start concatenating.
2020-12-17 02:26:17,202|ParallelRunStep.FileHelper|ERROR|112|No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
2020-12-17 02:26:17,368|ParallelRunStep.Telemetry|INFO|112|Run status: Running
2020-12-17 02:26:17,495|ParallelRunStep.Telemetry|ERROR|112|Exception occurred executing job: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause..
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
在完成的回合中,只有部分记录被 returned。有一次我认为 returned 的记录数是 25 或 23,另一次是 15。
========== 更新 #2:2020 年 12 月 17 日 ==========
我删除了我的一个模型(我的模型是 15 个模型的权重混合)。我什至清理了我的文本字段,删除了所有制表符、换行符和逗号。现在我正在对 30 个文件进行评分,每个文件有 1 条记录,工作有时会完成,但不会 return 30 条记录。其他时候它 return 是一个错误,并且仍然出现“找不到临时文件”错误。
我想我可能已经回答了我自己的问题。我认为问题在于
OutputFileDatasetConfig
一旦我切换回使用
PipelineData
一切都重新开始了。当他们说 OutputFileDatasetConfig 仍处于试验阶段时,我猜 Azure 并不是在开玩笑。
我仍然不明白的是,我们应该如何在没有 OutputFileDatasetConfig 的情况下从数据工厂管道中获取 ML Studio 管道的结果? PipelineData 根据子步骤 运行 id 在文件夹中输出结果,那么数据工厂应该如何知道从哪里获得结果?很想听听任何人可能有的任何反馈。谢谢:)
==更新==
要从数据工厂管道中获取 ML Studio 管道的结果,请查看 Pick up Results From ML Studio Pipeline in Data Factory Pipeline
== 更新#2 ==
https://github.com/Azure/azure-sdk-for-python/issues/16568#issuecomment-781526789
Hi @yeamusic21 , thank you for your feedback, in current version,
OutputDatasetConfig can't work with ParallelRunStep, we are working on
fixing it.
我已经成功地 运行 了 ML Pipeline 实验并毫无问题地发布了 Azure ML Pipeline。当我 运行 在成功 运行 后直接执行以下操作并发布时(即我 运行 使用 Jupyter 连接所有单元格),测试失败!
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 6}})
run_id = response.json()["Id"]
这是 azureml-logs/70_driver_log.txt 中的错误:
[2020-12-10T17:17:50.124303] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
3 items cleaning up...
Cleanup took 0.20258069038391113 seconds
Traceback (most recent call last):
File "driver/amlbi_main.py", line 48, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job_starter.py", line 52, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
异常:找不到临时文件。作业失败。作业应生成临时文件或在此之前失败。请检查日志以了解原因。
以下是 logs/sys/warning.txt 中的错误:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://eastus.experiments.azureml.net/execution/v1.0/subscriptions/***redacted***/resourceGroups/***redacted***/providers/Microsoft.MachineLearningServices/workspaces/***redacted***/experiments/***redacted-experiment-name***/runs/***redacted-run-id***/telemetry
[...]
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:
同URL.
下一个...
当我等待几分钟并重新运行以下内容时code/cell。
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 2}})
run_id = response.json()["Id"]
成功完成!?嗯?(我在这里更改了进程计数,但我认为这没有什么不同)。此外,日志中没有用户错误。
关于这里可能发生的事情有什么想法吗?
提前感谢您提供的任何见解,祝您编码愉快! :)
==========更新#1:==========
运行 在 1 个约 30 万行的文件上。有时这项工作有效,有时却无效。我们已经尝试过许多具有不同配置设置的版本,但有时都会导致失败。将 sklearn 模型更改为使用 n_jobs=1。我们正在为 NLP 工作的文本数据评分。
default_ds = ws.get_default_datastore()
# output dataset
output_dir = OutputFileDatasetConfig(destination=(def_file_store, 'model/results')).register_on_complete(name='model_inferences')
# location of scoring script
experiment_folder = 'model_pipeline'
rit = 60*60*24
parallel_run_config = ParallelRunConfig(
source_directory=experiment_folder,
entry_script="score.py",
mini_batch_size="5",
error_threshold=10,
output_action="append_row",
environment=batch_env,
compute_target=compute_target,
node_count=5,
run_invocation_timeout=rit,
process_count_per_node=1
)
我们的下一个测试将是 - 将每一行数据放入其自己的文件中。我只用 30 行尝试了这个,即 30 个文件,每个文件有 1 个评分记录,但仍然出现相同的错误。这次我把错误阈值改成了1.
2020-12-17 02:26:16,721|ParallelRunStep.ProgressSummary|INFO|112|The ParallelRunStep processed all mini batches. There are 6 mini batches with 30 items. Processed 6 mini batches containing 30 items, 30 succeeded, 0 failed. The error threshold is 1.
2020-12-17 02:26:16,722|ParallelRunStep.Telemetry|INFO|112|Start concatenating.
2020-12-17 02:26:17,202|ParallelRunStep.FileHelper|ERROR|112|No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
2020-12-17 02:26:17,368|ParallelRunStep.Telemetry|INFO|112|Run status: Running
2020-12-17 02:26:17,495|ParallelRunStep.Telemetry|ERROR|112|Exception occurred executing job: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause..
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
在完成的回合中,只有部分记录被 returned。有一次我认为 returned 的记录数是 25 或 23,另一次是 15。
========== 更新 #2:2020 年 12 月 17 日 ==========
我删除了我的一个模型(我的模型是 15 个模型的权重混合)。我什至清理了我的文本字段,删除了所有制表符、换行符和逗号。现在我正在对 30 个文件进行评分,每个文件有 1 条记录,工作有时会完成,但不会 return 30 条记录。其他时候它 return 是一个错误,并且仍然出现“找不到临时文件”错误。
我想我可能已经回答了我自己的问题。我认为问题在于
OutputFileDatasetConfig
一旦我切换回使用
PipelineData
一切都重新开始了。当他们说 OutputFileDatasetConfig 仍处于试验阶段时,我猜 Azure 并不是在开玩笑。
我仍然不明白的是,我们应该如何在没有 OutputFileDatasetConfig 的情况下从数据工厂管道中获取 ML Studio 管道的结果? PipelineData 根据子步骤 运行 id 在文件夹中输出结果,那么数据工厂应该如何知道从哪里获得结果?很想听听任何人可能有的任何反馈。谢谢:)
==更新==
要从数据工厂管道中获取 ML Studio 管道的结果,请查看 Pick up Results From ML Studio Pipeline in Data Factory Pipeline
== 更新#2 ==
https://github.com/Azure/azure-sdk-for-python/issues/16568#issuecomment-781526789
Hi @yeamusic21 , thank you for your feedback, in current version, OutputDatasetConfig can't work with ParallelRunStep, we are working on fixing it.