Synapse 专用池和行存储中的聚集索引

Clustered indexes in Synapse Dedicated Pool and row storage

我尝试了解 Azure Synapse 中的索引,但对其中的一些索引感到有些困惑。

关于Clustered Columnstore Index,我感觉它有点像Apache Parquet,里面有行组和列块。 heap tables 中的数据没有索引,所以看起来也很清楚。

但是聚集非聚集索引呢? The documentation 将它们定义为:

  Clustered indexes may outperform clustered columnstore tables when a single row needs to be quickly retrieved. For queries where a single or very few row lookup is required to perform with extreme speed, consider a clustered index or nonclustered secondary index. The disadvantage to using a clustered index is that only queries that benefit are the ones that use a highly selective filter on the clustered index column. To improve filter on other columns, a nonclustered index can be added to other columns. However, each index that is added to a table adds both space and processing time to loads.

这是我的问题:

  1. 这是否意味着它们更像是来自 SQL 服务器的索引?我的意思是,聚集索引会按一列对数据进行排序并将其存储为行?而非聚集将是一个额外的排序索引,仅存储对行的引用?
  2. 如果我关于基于行的格式的假设是正确的,这是否意味着聚簇索引对于分析查询来说性能不佳,不是吗?
  3. 如果我们创建一个同时包含列存储和聚簇索引的 table 会怎样?数据重复,一次是列式,一次是行式?

我找到了关于该主题的一些链接,但仍然怀疑它们是否适用于 Synapse:

巴尔托斯,

Does it mean they're more like the indexes from SQL Server? I mean, the clustered index would order the data by one column and store it as rows? And the non clustered would be an extra sorted index storing only references to the rows?

您对聚类和非聚类的定义是正确的——稍有不同。它类似于传统的SQL Server,集群的叶子是实际的数据行。 总之,Synapse/pdw 的数据行的物理组织将是

  • 集群列存储 - 数据未排序且行段可以具有重叠的最小值-最大值

  • 带排序方式的集群列存储 - 数据已排序,因此行段不会有重叠,跳过将是最优的

  • Heap - 这是行格式

  • 聚簇索引 这是 SQL 服务器聚簇索引,其中 lead/data 部分已排序。

If my assumption about row-based format is correct, does it mean the clustered index is not performant for the analytical queries, doesn't it?

如果您的查询 select 一组值是连续的,则聚簇索引将是高效的。例如 - select * from table where year between 2005 and 2007。如果您的 projection/select 包含 table 的全部或大部分列,则 Row/Heap table 是有效的。如果有宽 tables 和 select 少数列,列存储组织是有效的。

What happens if we create a table with both Columnstore and Clustered Indexes? The data is duplicated, once for the columnar format, once for the row format? If you have a columstore index, you wont be able to create a clustered index.

CREATE TABLE MyTable   
  (  
    mycolumnnn1 nvarchar,  
    mycolumn2 nvarchar COLLATE Frisian_100_CS_AS )  
WITH ( CLUSTERED COLUMNSTORE INDEX )  
;

将失败并出现以下错误:

create clustered index idx1 on Mytable(mycolumnnn1)
Msg 1902, Level 16, State 1, Line 8
Cannot create more than one clustered index on table 'Mytable'. Drop the existing clustered index 'ClusteredIndex_d79fca6646664ddea0d5983cbb17a8ae' before creating another.