在非交错索引上强制查询时,是否会有任何数据局部性优势?

When forcing a query on a non-interleaved index(s), will there be any data locality benefit?

假设以下架构:

CREATE TABLE Foo (
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    extraData STRING(80),
    active BOOL NOT NULL
) PRIMARY KEY (primaryId, secondaryId);

CREATE TABLE Bar (
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    barId STRING(64) NOT NULL
) PRIMARY KEY (primaryId, secondaryId, barId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;

CREATE TABLE Baz (
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    barId STRING(64) NOT NULL,
    bazId STRING(64) NOT NULL,
    extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, barId, bazId),
INTERLEAVE IN PARENT Bar ON DELETE CASCADE;

CREATE INDEX foo_primaryId_active ON foo (primaryId, active);
CREATE INDEX baz_bazId ON Baz (bazId);

我们有 3 tables Foo, Bar, Baz, BarFoo 中交错,BazBar[=54 中交错=].连同 2 个非交错索引。

给定以下查询,我们将 FROMJOIN 强加到索引上;没有明确的 tables.

SELECT
    baz.primaryId, 
    baz.secondaryId, 
    baz.bazId, 
    baz.extraData
FROM
    Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
    Foo@{FORCE_INDEX=foo_secondaryId_isActive} AS foo
ON
    foo.primaryId = baz.parimaryId AND foo.secondaryId = baz.secondaryId
WHERE
    baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
    foo.active = true

强制索引时此查询是否有数据局部性优势? 如果我们稍后添加第 4 个 table Zap 并将 table 交错在 Foo 上:

CREATE TABLE Zap (
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    bazId STRING(64) NOT NULL,
    extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, bazId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;

CREATE INDEX zap_bazId ON Zap (bazId);

并调整上述查询以包含第 3 个 JOIN:

JOIN
    Zap@{FORCE_INDEX=zap_bazId} AS zap
ON 
    zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
    baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
    foo.active = true
    zap.extraData IS NULL

我们会在这里获得任何数据本地化优势吗?因为我们正在查询所有非交错索引。我们的 zap.extraData IS NULL 谓词未存储在索引本身中,因此可能需要 运行 返回到 Zap table 进行检查。

如果在非交错索引上查询没有数据局部性优势,我们是否可以放弃额外的 zap_bazId 索引并更改 Zap table 因为我们知道我们将专门查询 bazId 以获取它托管的数据:

CREATE TABLE Zap (
    bazId STRING(64) NOT NULL,
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    extraData STRING(80)
) PRIMARY KEY (bazId, primaryId);

修改后的查询变为

JOIN
    Zap AS zap -- using a table; aka the implicit PRIMARY_KEY index
ON 
    zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
    baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
    foo.active = true AND
    zap.extraData IS NULL

现在,我们在这里丢失了 CASCADE DELETE,因此交错并创建该附加索引并将 zap.extraData 存储到索引中以避免它不得不返回到索引中可能仍然是值得的Zap table 以提取该信息。

问题仍然是:当 querying/joining 专门针对非交错索引时,数据局部性是否会发挥作用?

正如我从 the documentation 中了解到的那样,如果索引没有交错并且您 query/join 通过索引,数据局部性并不重要。如果您打算使用索引进行查询,您应该只需要交错索引。

无论如何,正如您所解释的,如果您对 ON DELETE CASCADE 语句感兴趣,您可以继续在 table 上使用交错,因为它无法完成 without interleaving

澄清:

Given a table with columns primaryId and secondaryId where the primary key of the table is primaryId. Creating a secondary index on secondaryId excludes it from being interleaved into the table.

是的。

If the indexes are not interleaved, there is no data locality at play

取决于查询。非交错索引和基 table 之间的连接不是本地的。您应该考虑索引中的 STORING 子句以避免连接。 table 与其父级之间的联接将是本地的。

query explanation dashboard 是展示 Cloud Spanner 如何执行特定查询的有用工具。使用它我们可以分析上述查询。

  • baz_bazIdBaz 之间有一个分布式连接,另一个与 foo_primaryId_active 的分布式连接。

    SELECT
        baz.primaryId,
        baz.secondaryId,
        baz.bazId,
        baz.extraData
    FROM
        Baz@{FORCE_INDEX=baz_bazId} AS baz
    JOIN
        Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
    ON
        foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
    WHERE
        baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
        AND foo.active = true
    

  • Zapzap_bazid之间增加了一个分布式连接,与其他分布式连接。

    SELECT
        baz.primaryId,
        baz.secondaryId,
        baz.bazId,
        baz.extraData
    FROM
        Baz@{FORCE_INDEX=baz_bazId} AS baz
    JOIN
        Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
    ON
        foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
    JOIN
        Zap@{FORCE_INDEX=zap_bazId} AS zap
    ON
        zap.bazId = @bazId AND zap.primaryId = foo.primaryId
    WHERE
        baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
        AND foo.active = true
        AND zap.extraData IS NULL
    

  • 它使用table Zap2Zap的非交错版本)而不需要Zap和[=之间的分布式连接23=] 在第二个查询中。

    SELECT
        baz.primaryId,
        baz.secondaryId,
        baz.bazId,
        baz.extraData
    FROM
        Baz@{FORCE_INDEX=baz_bazId} AS baz
    JOIN
        Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
    ON
        foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
    JOIN
        Zap2 AS zap -- using a table; aka the implicit PRIMARY_KEY index
    ON
        zap.bazId = @bazId AND zap.primaryId = foo.primaryId
    WHERE
        baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
        foo.active = true AND
        zap.extraData IS NULL
    

Spanner will handle all the related network I/O re: the data splits.

是的。

If indexes can be interleaved there would be a benefit but the keys in those interleaved indexes have to be shared (like with any interleaved table). The docs for locality tradeoffs: "Focus on getting the desired locality for the most important root entities and most common access patterns, and let less frequent or less performance sensitive distributed operations happen when they need to."

是的。