joblib.dump() 将模型保存到 AMLS 中的临时数据存储时失败

joblib.dump() fails when saving model to temporary data store in AMLS

我正在使用 AMLS 训练模型。我有一个训练管道,其中第 1 步训练一个模型,然后使用

将输出保存在临时数据存储 model_folder 中
os.makedirs(output_folder, exist_ok=True)
output_path = output_folder + "/model.pkl"
joblib.dump(value=model, filename=output_path)

第 2 步加载模型并注册它。模型文件夹在管道中定义为

model_folder = PipelineData("model_folder", datastore=ws.get_default_datastore())

但是,第 1 步在尝试保存模型时失败并出现以下 ServiceError:

由于异常无法上传输出:Microsoft.RelInfra.Common.Exceptions.OperationFailedException:无法上传输出 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx。 ---> Microsoft.WindowsAzure.Storage.StorageException: 此请求无权使用此权限执行此操作。

我该如何解决这个问题?在我的代码的前面,我使用

与默认数据存储交互没有问题
default_ds = ws.get_default_datastore()
default_ds.upload_files(...)

我的70_driver_log.txt如下:

[2020-08-25T04:03:27.315114] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['train_word2vec.py', '--output_folder', '/mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx/model_folder', '--model_type', 'WO', '--training_field', 'task_title', '--regex', '1', '--stopword_removal', '1', '--tokenize_basic', '0', '--remove_punctuation', '0', '--autocorrect', '0', '--lemmatization', '1', '--word_vector_length', '152', '--model_learning_rate', '0.025', '--model_min_count', '0', '--model_window', '7', '--num_epochs', '10'])
Starting the daemon thread to refresh tokens in background for process with pid = 113
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx
Preparing to call script [ train_word2vec.py ] with arguments: ['--output_folder', '/mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx/model_folder', '--model_type', 'WO', '--training_field', 'task_title', '--regex', '1', '--stopword_removal', '1', '--tokenize_basic', '0', '--remove_punctuation', '0', '--autocorrect', '0', '--lemmatization', '1', '--word_vector_length', '152', '--model_learning_rate', '0.025', '--model_min_count', '0', '--model_window', '7', '--num_epochs', '10']
After variable expansion, calling script [ train_word2vec.py ] with arguments: ['--output_folder', '/mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx/model_folder', '--model_type', 'WO', '--training_field', 'task_title', '--regex', '1', '--stopword_removal', '1', '--tokenize_basic', '0', '--remove_punctuation', '0', '--autocorrect', '0', '--lemmatization', '1', '--word_vector_length', '152', '--model_learning_rate', '0.025', '--model_min_count', '0', '--model_window', '7', '--num_epochs', '10']

Script type = None
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
OUTPUT FOLDER: /mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx/model_folder
Loading SQL data...
Loading abbreviation data...
/azureml-envs/azureml_xxxxx/lib/python3.6/site-packages/pandas/core/indexing.py:1783: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value
Pre-processing data...
Succesfully pre-processed the the text data
Training Word2Vec model...
Saving the model...
Starting the daemon thread to refresh tokens in background for process with pid = 113


The experiment completed successfully. Finalizing run...
[2020-08-25T04:03:52.293994] TimeoutHandler __init__
[2020-08-25T04:03:52.294149] TimeoutHandler __enter__
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.44109439849853516 seconds
[2020-08-25T04:03:52.818991] TimeoutHandler __exit__
2020/08/25 04:04:00 logger.go:293: Process Exiting with Code:  0

我的 arg 解析参数包括

parser.add_argument('--output_folder', type=str, dest='output_folder', default="output_folder", help='output folder')

一些想法:

  1. 正如@drum 所建议的那样,是一个权限错误。
  2. 您的 ArgumentParser 有一个小错字
  3. 使用os.path.join(output_folder, 'model.pkl')会不会出现同样的错误?

通过将我的 AMLS 工作区添加到 AMLS 默认存储帐户中的 'storage blob data contributor' 角色解决了这个问题。似乎通常默认情况下会添加此角色,但在我的情况下并没有发生。