了解 cassandra 的内部数据存储

Question

我有这个table

create table comment_by_post
(
    postId uuid,
    userId uuid,
    cmntId timeuuid,
    cmntTxt text,   
    cmntBy text,
    time bigint, 
    primary key ((postId, userId),cmntId)
)

这里有内部数据table

RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92e537a
=> (name=d3f02a30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270721107000)
=> (name=d3f02a30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e743434, timestamp=1434270721107000)
-------------------
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92eec7a
=> (name=465fee30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270483603000)
=> (name=465fee30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7432, timestamp=1434270483603000)
=> (name=4ba89f40-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270492468000)
=> (name=4ba89f40-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7431, timestamp=1434270492468000)
=> (name=504a61f0-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270500239000)
=> (name=504a61f0-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7433, timestamp=1434270500239000)
-------------------
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92e237a
=> (name=cd1e8f30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270709667000)
=> (name=cd1e8f30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7433, timestamp=1434270709667000)

如果我这样做 primary key (postId, userId,cmntId) 那么它就像：

RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b
=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:971da150-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264176613000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:971da150-1260-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7431, timestamp=1434264176613000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a0d4a900-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264192912000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a0d4a900-1260-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7432, timestamp=1434264192912000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a5d94c30-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264201331000)

为什么会这样，两者有什么好处？

Answer 1

第一个主键使用 postId 和 userId 作为分区键，使用 cmntId 作为聚簇列。请注意，用于 RowKey 的值包含来自 postId 和 userId 的值，由 : 分隔。接下来，聚类列的值用于行内每个单元格的名称。

在您的第二个示例中，主键缺少分区键两边的括号。它们可能会被省略，但通常更愿意出现，因为我们可以明确地确定主键的哪些部分用于分区和集群。当不包括额外的括号时 只有第一列 用作分区键（在 cassandra-cli 的 RowKey 值中可见）。假设所有后续列都是聚类列，我们可以通过查看单元格名称来验证。

Answer 2

Christopher 已经解释了如何将分区键连接在一起以生成用于存储的行键，所以我不会重新散列（没有双关语意）。但是我会解释这两种方式的优缺点。

PRIMARY KEY (postId, userId,cmntId)

使用此 PRIMARY KEY，您的数据按 postId 分区，并按 userId 和 cmntId 聚类。这意味着，post 上的所有评论将按 postId 一起存储在磁盘上，然后按 userId 和 cmntId（分别）排序。

这里的优势在于您具有查询灵活性。您可以查询 post 的所有评论，或特定用户对 post 的所有评论。

缺点是，与其他解决方案相比，您更有可能无限增长行。如果每个 postId 的总列数超过 20 亿，则每个 postId 可以存储的数据量将达到最大值。但是你每个 post 存储那么多评论数据的几率很低，所以你应该没问题。

PRIMARY KEY ((postId, userId),cmntId)

此解决方案通过 postId 和 userId 的串联行键（按 cmntId 排序）将评论数据存储在一起，从而帮助消除无限行增长的可能性。这是优于您的其他解决方案。

缺点是失去了查询的灵活性，因为现在您需要为每个查询提供 postId 和 userId。此 PRIMARY KEY 定义根本不支持仅使用 postId 的注释查询，因为 Cassandra CQL 要求您为查询提供整个分区键。

了解 cassandra 的内部数据存储

Understanding internal data storing by cassandra

cql

cassandra

cql3

cassandra-cli