Synapse Serverless 中的大型表 CETAS 超时 SQL

CETAS times out for large tables in Synapse Serverless SQL

我正在尝试使用 CETAS (CREATE EXTERNAL TABLE AS SELECT * FROM <table>) 语句从 Azure Synapse Serverless SQL 池中现有的外部 table 创建一个新的外部 table。我要选择的 table 是一个非常大的外部 table,它基于存储在 ADLS Gen 2 存储中的 parquet 格式的大约 30 GB 数据构建,但查询总是在大约 30 分钟后超时。我尝试过使用高级存储,并且也尝试了大多数 here 提出的建议(如果不是全部的话),但它没有帮助,查询仍然超时。 我在 Synapse Studio 中得到的错误是:-

Statement ID: {550AF4B4-0F2F-474C-A502-6D29BAC1C558} | Query hash: 0x2FA8C2EFADC713D | Distributed request ID: {CC78C7FD-ED10-4CEF-ABB6-56A3D4212A5E}. Total size of data scanned is 0 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes. Query timeout expired.

核心用例是假设我只有外部 table 名称,我想创建一个数据副本,在 Azure 存储本身中创建外部 table。

有没有办法解决这个超时问题或更好的方法来解决这个问题?

这是无服务器的限制。

Query timeout expired

The error Query timeout expired is returned if the query executed more than 30 minutes on serverless SQL pool. This is a limit of serverless SQL pool that cannot be changed. Try to optimize your query by applying best practices, or try to materialize parts of your queries using CETAS. Check is there a concurrent workload running on the serverless pool because the other queries might take the resources. In that case you might split the workload on multiple workspaces.

Self-help for serverless SQL pool - Query Timeout Expired

The core use case is that assuming I only have the external table name, I want to create a copy of the data over which that external table is created in Azure storage itself.

在数据工厂复制作业、Spark 作业或 AzCopy 中很容易做到。