在非交错索引上强制查询时,是否会有任何数据局部性优势?
When forcing a query on a non-interleaved index(s), will there be any data locality benefit?
假设以下架构:
CREATE TABLE Foo (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
extraData STRING(80),
active BOOL NOT NULL
) PRIMARY KEY (primaryId, secondaryId);
CREATE TABLE Bar (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
barId STRING(64) NOT NULL
) PRIMARY KEY (primaryId, secondaryId, barId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;
CREATE TABLE Baz (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
barId STRING(64) NOT NULL,
bazId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, barId, bazId),
INTERLEAVE IN PARENT Bar ON DELETE CASCADE;
CREATE INDEX foo_primaryId_active ON foo (primaryId, active);
CREATE INDEX baz_bazId ON Baz (bazId);
我们有 3 tables Foo, Bar, Baz, Bar 在 Foo 中交错,Baz 在 Bar[=54 中交错=].连同 2 个非交错索引。
给定以下查询,我们将 FROM 和 JOIN 强加到索引上;没有明确的 tables.
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_secondaryId_isActive} AS foo
ON
foo.primaryId = baz.parimaryId AND foo.secondaryId = baz.secondaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
foo.active = true
强制索引时此查询是否有数据局部性优势?
如果我们稍后添加第 4 个 table Zap 并将 table 交错在 Foo 上:
CREATE TABLE Zap (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
bazId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, bazId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;
CREATE INDEX zap_bazId ON Zap (bazId);
并调整上述查询以包含第 3 个 JOIN:
JOIN
Zap@{FORCE_INDEX=zap_bazId} AS zap
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
foo.active = true
zap.extraData IS NULL
我们会在这里获得任何数据本地化优势吗?因为我们正在查询所有非交错索引。我们的 zap.extraData IS NULL
谓词未存储在索引本身中,因此可能需要 运行 返回到 Zap table 进行检查。
如果在非交错索引上查询没有数据局部性优势,我们是否可以放弃额外的 zap_bazId
索引并更改 Zap table 因为我们知道我们将专门查询 bazId 以获取它托管的数据:
CREATE TABLE Zap (
bazId STRING(64) NOT NULL,
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (bazId, primaryId);
修改后的查询变为
JOIN
Zap AS zap -- using a table; aka the implicit PRIMARY_KEY index
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
foo.active = true AND
zap.extraData IS NULL
现在,我们在这里丢失了 CASCADE DELETE,因此交错并创建该附加索引并将 zap.extraData
存储到索引中以避免它不得不返回到索引中可能仍然是值得的Zap table 以提取该信息。
问题仍然是:当 querying/joining 专门针对非交错索引时,数据局部性是否会发挥作用?
正如我从 the documentation 中了解到的那样,如果索引没有交错并且您 query/join 通过索引,数据局部性并不重要。如果您打算使用索引进行查询,您应该只需要交错索引。
无论如何,正如您所解释的,如果您对 ON DELETE CASCADE
语句感兴趣,您可以继续在 table 上使用交错,因为它无法完成 without interleaving。
澄清:
Given a table with columns primaryId
and secondaryId
where the primary key of the table is primaryId
. Creating a secondary index on secondaryId
excludes it from being interleaved into the table.
是的。
If the indexes are not interleaved, there is no data locality at play
取决于查询。非交错索引和基 table 之间的连接不是本地的。您应该考虑索引中的 STORING
子句以避免连接。 table 与其父级之间的联接将是本地的。
query explanation dashboard 是展示 Cloud Spanner 如何执行特定查询的有用工具。使用它我们可以分析上述查询。
baz_bazId
和 Baz
之间有一个分布式连接,另一个与 foo_primaryId_active
的分布式连接。
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
ON
foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
AND foo.active = true
在Zap
和zap_bazid
之间增加了一个分布式连接,与其他分布式连接。
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
ON
foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
JOIN
Zap@{FORCE_INDEX=zap_bazId} AS zap
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
AND foo.active = true
AND zap.extraData IS NULL
它使用table Zap2
(Zap
的非交错版本)而不需要Zap
和[=之间的分布式连接23=] 在第二个查询中。
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
ON
foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
JOIN
Zap2 AS zap -- using a table; aka the implicit PRIMARY_KEY index
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
foo.active = true AND
zap.extraData IS NULL
Spanner will handle all the related network I/O re: the data splits.
是的。
If indexes can be interleaved there would be a benefit but the keys in those interleaved indexes have to be shared (like with any interleaved table). The docs for locality tradeoffs: "Focus on getting the desired locality for the most important root entities and most common access patterns, and let less frequent or less performance sensitive distributed operations happen when they need to."
是的。
假设以下架构:
CREATE TABLE Foo (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
extraData STRING(80),
active BOOL NOT NULL
) PRIMARY KEY (primaryId, secondaryId);
CREATE TABLE Bar (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
barId STRING(64) NOT NULL
) PRIMARY KEY (primaryId, secondaryId, barId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;
CREATE TABLE Baz (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
barId STRING(64) NOT NULL,
bazId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, barId, bazId),
INTERLEAVE IN PARENT Bar ON DELETE CASCADE;
CREATE INDEX foo_primaryId_active ON foo (primaryId, active);
CREATE INDEX baz_bazId ON Baz (bazId);
我们有 3 tables Foo, Bar, Baz, Bar 在 Foo 中交错,Baz 在 Bar[=54 中交错=].连同 2 个非交错索引。
给定以下查询,我们将 FROM 和 JOIN 强加到索引上;没有明确的 tables.
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_secondaryId_isActive} AS foo
ON
foo.primaryId = baz.parimaryId AND foo.secondaryId = baz.secondaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
foo.active = true
强制索引时此查询是否有数据局部性优势? 如果我们稍后添加第 4 个 table Zap 并将 table 交错在 Foo 上:
CREATE TABLE Zap (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
bazId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, bazId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;
CREATE INDEX zap_bazId ON Zap (bazId);
并调整上述查询以包含第 3 个 JOIN:
JOIN
Zap@{FORCE_INDEX=zap_bazId} AS zap
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
foo.active = true
zap.extraData IS NULL
我们会在这里获得任何数据本地化优势吗?因为我们正在查询所有非交错索引。我们的 zap.extraData IS NULL
谓词未存储在索引本身中,因此可能需要 运行 返回到 Zap table 进行检查。
如果在非交错索引上查询没有数据局部性优势,我们是否可以放弃额外的 zap_bazId
索引并更改 Zap table 因为我们知道我们将专门查询 bazId 以获取它托管的数据:
CREATE TABLE Zap (
bazId STRING(64) NOT NULL,
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (bazId, primaryId);
修改后的查询变为
JOIN
Zap AS zap -- using a table; aka the implicit PRIMARY_KEY index
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
foo.active = true AND
zap.extraData IS NULL
现在,我们在这里丢失了 CASCADE DELETE,因此交错并创建该附加索引并将 zap.extraData
存储到索引中以避免它不得不返回到索引中可能仍然是值得的Zap table 以提取该信息。
问题仍然是:当 querying/joining 专门针对非交错索引时,数据局部性是否会发挥作用?
正如我从 the documentation 中了解到的那样,如果索引没有交错并且您 query/join 通过索引,数据局部性并不重要。如果您打算使用索引进行查询,您应该只需要交错索引。
无论如何,正如您所解释的,如果您对 ON DELETE CASCADE
语句感兴趣,您可以继续在 table 上使用交错,因为它无法完成 without interleaving。
澄清:
Given a table with columns
primaryId
andsecondaryId
where the primary key of the table isprimaryId
. Creating a secondary index onsecondaryId
excludes it from being interleaved into the table.
是的。
If the indexes are not interleaved, there is no data locality at play
取决于查询。非交错索引和基 table 之间的连接不是本地的。您应该考虑索引中的 STORING
子句以避免连接。 table 与其父级之间的联接将是本地的。
query explanation dashboard 是展示 Cloud Spanner 如何执行特定查询的有用工具。使用它我们可以分析上述查询。
baz_bazId
和Baz
之间有一个分布式连接,另一个与foo_primaryId_active
的分布式连接。SELECT baz.primaryId, baz.secondaryId, baz.bazId, baz.extraData FROM Baz@{FORCE_INDEX=baz_bazId} AS baz JOIN Foo@{FORCE_INDEX=foo_primaryId_active} AS foo ON foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId WHERE baz.bazId = @bazId -- using the baz_bazId index to query on the bazId AND foo.active = true
在
Zap
和zap_bazid
之间增加了一个分布式连接,与其他分布式连接。SELECT baz.primaryId, baz.secondaryId, baz.bazId, baz.extraData FROM Baz@{FORCE_INDEX=baz_bazId} AS baz JOIN Foo@{FORCE_INDEX=foo_primaryId_active} AS foo ON foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId JOIN Zap@{FORCE_INDEX=zap_bazId} AS zap ON zap.bazId = @bazId AND zap.primaryId = foo.primaryId WHERE baz.bazId = @bazId -- using the baz_bazId index to query on the bazId AND foo.active = true AND zap.extraData IS NULL
它使用table
Zap2
(Zap
的非交错版本)而不需要Zap
和[=之间的分布式连接23=] 在第二个查询中。SELECT baz.primaryId, baz.secondaryId, baz.bazId, baz.extraData FROM Baz@{FORCE_INDEX=baz_bazId} AS baz JOIN Foo@{FORCE_INDEX=foo_primaryId_active} AS foo ON foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId JOIN Zap2 AS zap -- using a table; aka the implicit PRIMARY_KEY index ON zap.bazId = @bazId AND zap.primaryId = foo.primaryId WHERE baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId foo.active = true AND zap.extraData IS NULL
Spanner will handle all the related network I/O re: the data splits.
是的。
If indexes can be interleaved there would be a benefit but the keys in those interleaved indexes have to be shared (like with any interleaved table). The docs for locality tradeoffs: "Focus on getting the desired locality for the most important root entities and most common access patterns, and let less frequent or less performance sensitive distributed operations happen when they need to."
是的。