在 Azure 中处理分区数据？

Question

我在 ADLS (gen2) 中有一些容器，并且在该容器中有多个文件夹。我想要一种机制来扫描这些文件夹以推断它们的模式并检测分区并在数据目录中更新它们。我如何在 Azure 中实现此功能？

样本：

- container1
---table1-folder
-----10-12-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-13-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-14-1970
-------files1.parquet
-------files2.parquet
----table2-folder
-----zipcode1
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----zipcode2
-------files1.parquet
-------files2.parquet

...

所以，我期望在目录中，它将创建两个 tables (table1 & table2)，其中 table1 将具有基于日期的分区（本例中为 3 个日期），并且在 table 中有下划线数据。 table2 相同，它将有两个分区及其下划线数据。

在 AWS 世界中，我可以运行可以爬取这些文件的 Glue 爬虫，推断模式和分区，并填充 Glue 数据目录，稍后我可以通过 Athena 查询它们。实现类似功能的 Azure 等效方法是什么？

Answer 1

我建议查看 Azure Synapse Analytics Serverless SQL. You can create a view which consumes the folders and does partition elimination if you follow this approach:

-- If you do not have a Master Key on your DW you will need to create one
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>' ;

GO

CREATE DATABASE SCOPED CREDENTIAL msi_cred 
WITH IDENTITY = 'Managed Service Identity' ;

GO

CREATE EXTERNAL DATA SOURCE ds_container1
WITH 
  ( TYPE = HADOOP , 
    LOCATION = 'abfss://container1@mystorageaccount.dfs.core.windows.net' , 
    CREDENTIAL = msi_cred
  ) ;

GO

CREATE VIEW Table2
AS SELECT *, f.filepath(1) AS [zipcode]
FROM
    OPENROWSET(
        BULK 'table2-folder/*/*.parquet',
        DATA_SOURCE = 'ds_container1',
        FORMAT='PARQUET'
    ) AS f

然后设置Azure Purview as your data catalog and have it index your Synapse Serverless SQL pool.

在 Azure 中处理分区数据？

Handling partitioned data in Azure?

azure

azure-data-factory

azure-data-lake

azure-databricks

azure-purview