具有多个集合值的 Cassandra CQL where 子句?

Cassandra CQL where clause with multiple collection values?

我的数据模型:-

tid                                  | codes        | raw          | type
-------------------------------------+--------------+--------------+------
a64fdd60-1bc4-11e5-9b30-3dca08b6a366 | {12, 34, 53} | {sdafb=safd} |  cmd

CREATE TABLE MyTable (
tid       TIMEUUID,
type      TEXT,
codes     SET<INT>,
raw       TEXT,
PRIMARY KEY (tid)
);
CREATE INDEX ON myTable (codes);

如何根据多个设置值查询table到return行。

这个有效:-

select * from logData where codes contains 34;

但我想根据多个设置值获取行,none 这个行得通:-

select * from logData where codes contains 34, 12; or 
select * from logData where codes contains 34 and 12; or
select * from logData where codes contains {34, 12};

请多多指教

您使用的数据模型效率非常低。集合旨在用于获取给定主键的一组数据,而不是相反。如果这是需要的,您将不得不重新考虑模型本身。

我建议为您在集合中使用的每个值创建不同的列,然后将这些列用作复合主键。

您真的希望仅根据代码获取所有日志条目吗?那可能是一个相当大的数据集。实际上,您不会查看特定日期/日期范围吗?我会重点关注它,然后使用代码进行过滤,甚至完全在客户端过滤代码。

如果您有很多代码,并且在集合上建立索引,则可能会导致索引的基数非常高,这会给您带来问题。无论您有自己的查找 table 还是使用索引,请记住您本质上有一个 "table",其中 pk 是值,并且每个 "row" 都有对应该值的行匹配值。如果它看起来大得令人无法接受,那么它就是这样。

我建议重新审视要求 - 再次...您真的需要所有匹配特定代码组合的日志条目吗?

如果您确实需要分析所有内容,那么我建议使用 Spark 来 运行 这项工作。然后,您可以 运行 一个 Spark 作业,每个节点将处理同一节点上的数据;与完全在应用程序中进行 table 处理相比,这将显着减少影响。

If I create your table structure and insert a similar row to yours above, I can check for multiple values in the codes collection like this:

aploetz@cqlsh:Whosebug2> SELECT * FROM mytable 
    WHERE codes CONTAINS 34 
      AND codes CONTAINS 12
      ALLOW FILTERING;

 tid                                  | codes        | raw          | type
--------------------------------------+--------------+--------------+------
 2569f270-1c06-11e5-92f0-21b264d4c94d | {12, 34, 53} | {sdafb=safd} |  cmd

(1 rows)

Now as others have mentioned, let me also tell you why this is a terrible idea...

With a secondary index on the collection (and with the cardinality appearing to be fairly high) every node will have to be checked for each query. The idea with Cassandra, is to query by partition key as often as possible, that way you only have to hit one node per query. Apple's Richard Low wrote a great article called The sweet spot for Cassandra secondary indexes. It should make you re-think the way you use secondary indexes.

Secondly, the only way I could get Cassandra to accept this query, was to use ALLOW FILTERING. What this means, is that the only way Cassandra can apply all of your fitlering criteria (WHERE clause) is to pull back every row and individually filter-out the rows that do not meet your criteria. Horribly inefficient. To be clear, the ALLOW FILTERING directive is something that you should never use.

In any case, if codes are something that you will need to query by, then you should design an additional query table with codes as a part of the PRIMARY KEY.

我知道来晚了。国际海事组织模型几乎没有微小的变化就足以达到预期的效果。可以做的是拥有与被查询集合的幂集成员一样多的行。

CREATE TABLE data_points_ks.mytable (
    codes frozen<set<int>>,
    tid timeuuid,
    raw text,
    type text,
    PRIMARY KEY (codes, tid)
) WITH CLUSTERING ORDER BY (tid ASC)

INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {34}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 34}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {34, 53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 34, 53}, '{sdafb=safd}', 'cmd');

 tid                                  | codes        | raw          | type
--------------------------------------+--------------+--------------+------
 8ae81763-1142-11e8-846c-cd9226c29754 |     {34, 53} | {sdafb=safd} |  cmd
 8746adb3-1142-11e8-846c-cd9226c29754 |     {12, 53} | {sdafb=safd} |  cmd
 fea77062-1142-11e8-846c-cd9226c29754 |         {34} | {sdafb=safd} |  cmd
 70ebb790-1142-11e8-846c-cd9226c29754 |     {12, 34} | {sdafb=safd} |  cmd
 6c39c843-1142-11e8-846c-cd9226c29754 |         {12} | {sdafb=safd} |  cmd
 65a954f3-1142-11e8-846c-cd9226c29754 |         null | {sdafb=safd} |  cmd
 03c60433-1143-11e8-846c-cd9226c29754 |         {53} | {sdafb=safd} |  cmd
 82f68d70-1142-11e8-846c-cd9226c29754 | {12, 34, 53} | {sdafb=safd} |  cmd

那么下面的查询就足够了,不需要任何过滤。

SELECT * FROM mytable 
WHERE codes = {12, 34};

SELECT * FROM mytable 
WHERE codes = {34};