Cassandra - 如何按最新时间戳分组

Cassandra - How group by latest timestamp

我在这里看到了一些相关主题,但我仍然不清楚,如何使用 cassandra 4.0.1 按最新行值分组

假设我的 table 看起来像;

CREATE TABLE simple_search (
    engine text,
    term text,
    time bigint,
    rank bigint,
    url text,
    domain text,
    pagenum bigint,
    descr text,
    display_url text,
    title text,
    type text,
    PRIMARY KEY ((domain), term , time , engine, url , pagenum)
) WITH CLUSTERING ORDER BY (term DESC, time DESC,  engine DESC , url DESC);

我的数据如下:

SELECT time, rank, term  from search_by_domain_termsV2 where domain ='zerotoappstore.com' 



time ,    rank, term 
1633297772, 105,  avfoundation swift
1633315263, 112,  best ide
1633332881, 119,  best ide
1633365856, 50,   developing an app cost
1633375273, 36,   developing an app cost

我想要分组后

time ,    rank, term 
1633297772, 105,  avfoundation swift
1633332881, 119,  best ide
1633375273, 36,   developing an app cost

如果我这样做

SELECT max(time) , rank, term  from search_by_domain_termsV2 where domain ='zerotoappstore.com'  GROUP BY term;

它给了我正确的最大时间值但不是正确的评级和期限。

1633297772  105 avfoundation swift
1633332881  112 best ide
1633375273  50  developing an app cost

是否可以按term分组,取时间的最大值?

@VitalyT,

首先,如果我们没有将 pagenum 指定为 create table 构造的 clustering order by 子句的一部分,则会出现如下错误:

InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering key columns must exactly match columns in CLUSTERING ORDER BY directive"

所以,它必须像下面这样:

CREATE TABLE IF NOT EXISTS simple_search(
...
PRIMARY KEY (domain, term, time, engine, url, pagenum)
) WITH CLUSTERING ORDER BY (term DESC, time DESC, engine DESC, url [ASC|DESC]);

接下来,给出5行的数据样本。请注意,我假设了 engineurlpagenum 列的某些值,因为原始问题中未提供这些值:

SELECT * FROM simple_search ;
 domain             | term                   | time       | engine  | url  | pagenum | descr | display_url | rank | title | type
--------------------+------------------------+------------+---------+------+---------+-------+-------------+------+-------+------
 zerotoappstore.com | developing an app cost | 1633375273 | engine5 | url5 |       5 |  null |        null |   36 |  null | null
 zerotoappstore.com | developing an app cost | 1633365856 | engine4 | url4 |       4 |  null |        null |   50 |  null | null
 zerotoappstore.com |               best ide | 1633332881 | engine3 | url3 |       3 |  null |        null |  119 |  null | null
 zerotoappstore.com |               best ide | 1633315263 | engine2 | url2 |       2 |  null |        null |  112 |  null | null
 zerotoappstore.com |     avfoundation swift | 1633297772 | engine1 | url1 |       1 |  null |        null |  105 |  null | null

(5 rows)

如果我们只检索 MAX(time) 列(没有任何 GROUP BY),我们将得到以下结果:

SELECT MAX(time),rank,term FROM simple_search WHERE domain = 'zerotoappstore.com';

 system.max(time) | rank | term
------------------+------+------------------------
       1633375273 |   36 | developing an app cost

(1 rows)

现在,让我们看看如果我们将 GROUP BY term 子句包含在完全相同的 SELECT 语句中会发生什么:

SELECT MAX(time), rank, term FROM simple_search WHERE domain = 'zerotoappstore.com' GROUP BY term;
 system.max(time) | rank | term
------------------+------+------------------------
       1633375273 |   36 | developing an app cost
       1633332881 |  119 |               best ide
       1633297772 |  105 |     avfoundation swift

(3 rows)

如果我们删除 time 列上的 MAX 聚合函数怎么办,因为我们已经按降序存储了 time 列的数据?我们得到以下信息:

SELECT time,rank,term FROM simple_search WHERE domain = 'zerotoappstore.com' GROUP BY term;

 time       | rank | term
------------+------+------------------------
 1633375273 |   36 | developing an app cost
 1633332881 |  119 |               best ide
 1633297772 |  105 |     avfoundation swift

(3 rows)

这是你想要的结果吗?另请参阅 the corresponding documentation 以了解特定条件。