Cassandra 建模，我有十亿的某种数字代码要存储，我应该使用宽行（带簇键的 CQL）吗？

Question

我目前正在做 Cassandra 建模，我有数十亿的某种数字代码hnm_code要存储，像这样：

create table hnm (
    create_batch_id int, // A creation batch can generate up to 1 million code.
    hnm_code text,       // Cardinality: billions
    product_name text,
    primary key (hnm_code)
);

create_batch_id 的基数与 hnm_code 相比相对较小。但是，我想要的是我应该能够使用单个 hnm_code 列的值来查询该记录（查询时 create_batch_id 是未知的）。我应该像这样使用宽行（带簇键的 CQL）吗？：

create table hnm_with_cluster_key (
    create_batch_id int,
    hnm_code text,
    product_name text,
    primary key (create_batch_id, hnm_code)
);

谢谢！如果你能告诉我如何在大规模查询上获得良好的性能，并均匀分布 hnm_code?

，那就太好了

Answer 1

Cassandra不同于另一个SQL，它使用第一个主键作为分区键。在我看来，分区键最好不要唯一。所以第二种设计更好

Answer 2

what I want is that I should be able to use a value of a single hnm_code column to inquire that record

在 Cassandra 中，您应该设计模型以匹配您的查询模式。所以这个案例说明了一切。 hnm_code 上具有分区键的第一个解决方案将满足此要求。

the create_batch_id is unknown at the time of query

如果您要将第二种解决方案与 PRIMARY KEY (create_batch_id, hnm_code) 一起使用，您将需要在查询时知道（并提供）create_batch_id。

It would be nice if you could advise me on how can I achieve good performance on massive this query, and evenly distribution of hnm_code?

Cassandra 行按分区键的散列值分布。因此，该键的基数越高，您在集群中的分布就越均匀。此外，Cassandra 旨在通过分区键查找执行得很好，因此您的查询应该非常快。

In addition, with the 2nd table definition, my query looks like this: select * from hnm_with_cluster_key where hnm_code='1234' allow filtering;

对于数十亿的 CQL 行数，使用 ALLOW FILTERING 指令不会表现良好。我强烈建议反对。

Now I suppose maybe I just need these 2 tables both, One for select a single hnm_code row by a single condition hnm_code = $hnm_code, one for select a creation batch of hnm_codes by create_batch_id = $batch_id, but I resent this duplication, considering that billions of rows is doubled.

这就是你问题的症结所在。 Cassandra 根本不支持允许这种查询灵活性的类型。从单个 table 设计中支持多个查询通常是不可行的。如果您需要支持 create_batch_id 查询，那么您将需要 both 个 table。每个模型都不会支持对另一个模型的良好查询。

是的，数据 duplication/redundancy 可能违反了我们在学校教授的关于规范化的所有内容。但 Cassandra 并非设计用于完全规范化的模型。我去年为 Planet Cassandra 写了一篇文章，讨论了其中的一些权衡：Escaping Disco-Era Data Modeling。

从本质上讲，虽然海量数据复制并不是任何人真正想做的事情，但在设计高性能 Cassandra 模型时，这可能是一个必要的权衡。

Cassandra 建模，我有十亿的某种数字代码要存储，我应该使用宽行（带簇键的 CQL）吗？

Cassandra modelling, I have a billion of some kind of digital code to store, should I use wide row (CQL with cluster key)?

modeling

cassandra