AzureML：镶木地板文件为空时数据集配置文件失败

Question

我使用 Azure ML python API 创建了表格数据集。有问题的数据是分布在多个分区中的 Azure Data Lake Gen 2 中的一堆镶木地板文件（约 10K 个镶木地板文件，每个文件大小为 330 KB）。当我触发数据集的“生成配置文件”操作时，它在处理空镶木地板文件时抛出以下错误，然后配置文件生成停止。

User program failed with ExecutionError: 
Error Code: ScriptExecution.StreamAccess.Validation
Validation Error Code: NotSupported
Validation Target: ParquetFile
Failed Step: 77866d0a-8243-4d3d-8bc6-599d466488dd
Error Message: ScriptExecutionException was caused by StreamAccessException.
  Failed to read Parquet file at: <my_blob_path>/20211217.parquet
    Current parquet file is not supported.
      Exception of type 'Thrift.Protocol.TProtocolException' was thrown.
| session_id=6be4db0b-bdc1-4dd6-b8a6-6e9466f7bc54

通过空镶木地板文件，我的意思是如果我使用 pandas (pd.read_parquet) 读取单个镶木地板文件，它会导致空 DF (df.empty == True ).

任何避免此错误的建议将不胜感激。

更新该问题已在以下版本中修复：

azureml 数据准备：3.0.1
azureml-核心：1.40.0

Answer 1

Error Code: ScriptExecution.StreamAccess.Validation

以上错误是因为您无法访问ADLS造成的。

您可以创建 Azure App Identity 并将读取权限分配给 ADLS。现在使用客户端 ID 和应用身份的秘密将 ADLS 注册为工作区中的数据存储。完成这些步骤后，您的代码将能够访问数据存储。

参考 - https://docs.microsoft.com/en-us/azure/machine-learning/how-to-network-security-overview#configure-a-datastore-to-use-managed-identity

Answer 2

感谢您的报告。这是处理包含列但设置为空行的镶木地板文件时的错误。此问题已修复并将包含在下一个版本中。

AzureML：镶木地板文件为空时数据集配置文件失败

AzureML: Dataset Profile fails when parquet file is empty

python

azure

azure-machine-learning-service

azureml-python-sdk