如何使用 Azure Data Lake Store 作为 Azure ML 的输入数据集?

How to use Azure Data Lake Store as an input data set for Azure ML?

我正在将数据移至 Azure Data Lake Store 并使用 Azure Data Lake Analytics 对其进行处理。数据采用 XML 形式,我正在通过 XML Extractor 读取它。现在我想从 Azure ML 访问这些数据,看起来目前不直接支持 Azure Data Lake 存储。

将 Azure Data Lake Store 与 Azure ML 一起使用的可能方法有哪些?

如您所述,目前 Azure Data Lake Store 不是受支持的来源。也就是说,Azure Data Lake Analytics 也可用于将数据写出到 Azure Blob Store,因此您可以使用它作为一种方法来处理 U-SQL 中的数据,然后将其暂存给 Azure 机器学习以从 Blob 存储处理它。当 Azure ML 支持 Data Lake 存储时,您可以将其切换过来。


account_name=os.getenv("ADLSGEN2_ACCOUNTNAME_62", "<storage account name>") # ADLS Gen2 account name
tenant_id=os.getenv("ADLSGEN2_TENANT_62", "") # tenant id of service principal
client_id=os.getenv("ADLSGEN2_CLIENTID_62", "") # client id of service principal
client_secret=os.getenv("ADLSGEN2_CLIENT_SECRET_62", "") # the secret of service principal

try:
    adlsgen2_datastore = Datastore.get(workspace, adlsgen2_datastore_name)
    print("Found ADLS Gen2 datastore with name: %s" % adlsgen2_datastore_name)
    datastore_paths = [(adlsgen2_datastore, 'path to data.csv')]
    dataset = Dataset.Tabular.from_delimited_files(path=datastore_paths)
    df = dataset.to_pandas_dataframe()
    display(dataset.to_pandas_dataframe())
    datastore = adlsgen2_datastore
    dataset = Dataset.Tabular.register_pandas_dataframe(df, datastore, "<DataSetStep>", show_progress=True)

except:
    adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(
        workspace=workspace,
        datastore_name=adlsgen2_datastore_name,
        filesystem='fs', # Name of ADLS Gen2 filesystem
        account_name=account_name, # ADLS Gen2 account name
        tenant_id=tenant_id, # tenant id of service principal
        client_id=client_id, # client id of service principal
        client_secret=client_secret) # the secret of service principal

参考:https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-data-transfer.ipynb