AzureML 从具有多个文件的数据存储创建数据集 - 路径无效

Question

我正在尝试在 Azure ML 中创建数据集，其中数据源是 Blob 存储中的多个文件（例如图像）。你如何正确地做到这一点？

这是我按照 UI

中记录的方法得到的错误

当我在 UI 和 select blob 存储和目录中仅使用 dirname 或 dirname/** 创建数据集时，无法在带有错误 ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. 的资源管理器选项卡当我尝试使用使用选项卡中的代码片段下载数据时，出现错误：

from azureml.core import Workspace, Dataset

# set variables 

workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)

Error Message: ScriptExecutionException was caused by StreamAccessException.
  StreamAccessException was caused by NotFoundException.
    Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'

当我只 select 其中一个文件而不是 dirname 或 dirname/** 时，一切正常。 AzureML 实际上支持由多个文件组成的数据集吗？

这是我的设置：

我有一个包含一个容器的数据存储 data。在目录 testdata 中包含 testfile1.txt 和 testfile2.txt.

在 AzureML 中，我创建了一个数据存储 testdatastore，然后我 select 在我的数据存储中 data 容器。

然后在 Azure ML 中，我从数据存储、select 文件数据集和上面的数据存储创建数据集。然后我可以浏览文件，select 一个文件夹和 select 应该包含子目录中的文件。然后创建路径 testdata/**，但如上所述不起作用。

我在 python 中创建数据集和数据存储时遇到了同样的问题：

import azureml.core
from azureml.core import Workspace, Datastore, Dataset

ws = Workspace.from_config()

datastore = Datastore(ws, "mydatastore")

datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")

Answer 1

数据集肯定支持多个文件，所以你的问题几乎可以肯定是在创建“mydatastore”数据存储时给出的权限（我怀疑你已经使用 SAS 令牌创建这个数据存储）。为了能够访问除单个文件之外的任何内容，您需要向数据存储授予 list 权限。如果您使用帐户密钥注册数据存储区，这将不是问题，但可能是访问令牌的限制。 the provided path is not valid or the files could not be accessed 的第二部分是指潜在的权限问题。您还可以通过使用 ml 工作区为您配置的 defaultblobstore 创建数据集来验证 folder/** 语法是否有效。

Answer 2

我使用此脚本上传并注册了文件，一切正常。

from azureml.core import Datastore, Dataset, Workspace

import logging

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"

azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"


workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)

logger.info("Uploading data...")
datastore.upload(
    src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")

logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
    workspace=workspace,
    name=azure_dataset_name,
    description=azure_dataset_description,
    create_new_version=True,
)
logger.info("Registering dataset done.")

AzureML 从具有多个文件的数据存储创建数据集 - 路径无效

AzureML create dataset from datastore with multiple files - path not valid

azureml

这是我按照 UI

这是我的设置：