CQL (cassandra) - Select 只有其中一列中具有最大值的行
CQL (cassandra) - Select only the rows with maximum value in one of the columns
我需要找到具有给定 stationid、time1 大于指定时间和最大 time2 的行。
table是这样创建的:
CREATE TABLE forec (
stationid int,
time1 timestamp,
time2 timestamp,
value double,
PRIMARY KEY ((stationid), time1, time2)
) WITH CLUSTERING ORDER BY (time1 DESC)
假设 table 中的数据是这样的:
+------------+-----------------------+----------------------+--------+
| stationid | time1 | time2 | value |
+------------+-----------------------+----------------------+--------+
| 1 | 2020-10-21 06:00:00 | 2020-10-21 05:00:00 | 1 |
| 1 | 2020-10-21 06:00:00 | 2020-10-21 04:00:00 | 2 |
| 1 | 2020-10-21 06:00:00 | 2020-10-21 03:00:00 | 3 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 04:00:00 | 4 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 03:00:00 | 5 |
| 1 | 2020-10-21 04:00:00 | 2020-10-21 02:00:00 | 6 |
+------------+-----------------------+----------------------+--------+
我想查询:
给我所有 stationid = 1 和 time1 >= 2020-10-21 05:00:00 且 time2 具有最大值的行。查询应 return 以下行:
+------------+-----------------------+----------------------+--------+
| stationid | time1 | time2 | value |
+------------+-----------------------+----------------------+--------+
| 1 | 2020-10-21 06:00:00 | 2020-10-21 05:00:00 | 1 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 04:00:00 | 4 |
+------------+-----------------------+----------------------+--------+
我知道我可以这样查询:
SELECT * FROM forec WHERE stationid = 1 AND time1 >= '2020-10-21 05:00:00';
然后在客户端过滤结果(并只保留具有最大时间的行2),但是我想知道这是否可以更有效地完成(在 Cassandra 端过滤结果)。
或者我应该更改 table 模型?
编辑:根据 Cassandra document,“如果在没有聚合函数的情况下选择列,在带有 GROUP BY 的语句中,将返回每个组中遇到的第一个值。”因此,以下示例仅在 time2
以 DESC
顺序存储时才有效。
如果您使用的是最新版本的 Cassandra(例如 3.11.x),那么您可以使用 GROUP BY
来执行类似
的操作
SELECT
stationid,
time1,
max(time2) AS max_time2,
value
FROM
forec
WHERE
stationid = 1
AND
time1 >= '2020-10-21 05:00:00'
GROUP BY time1;
你得到
cqlsh:test> SELECT stationid, time1, max(time2) as max_time2, value FROM forec WHERE stationid = 1 AND time1 >= '2020-10-21 05:00:00' GROUP BY time1;
stationid | time1 | max_time2 | value
-----------+---------------------------------+---------------------------------+-------
1 | 2020-10-21 06:00:00.000000+0000 | 2020-10-21 05:00:00.000000+0000 | 1
1 | 2020-10-21 05:00:00.000000+0000 | 2020-10-21 04:00:00.000000+0000 | 4
(2 rows)
请注意,这会扫描您的分区,因此请注意分区大小,尤其是当您在集群列中使用时间戳时。
使用UDA/UDFs的解决方案:
状态函数:
CREATE OR REPLACE FUNCTION curValState ( state tuple<timestamp,double>, time timestamp, value double ) CALLED ON NULL INPUT RETURNS tuple<timestamp, double> LANGUAGE java AS 'if (time != null && value != null) { if(state == null) {com.datastax.driver.core.TupleType tupleType = com.datastax.driver.core.TupleType.of(com.datastax.driver.core.ProtocolVersion.NEWEST_SUPPORTED, com.datastax.driver.core.CodecRegistry.DEFAULT_INSTANCE, com.datastax.driver.core.DataType.timestamp(), com.datastax.driver.core.DataType.cdouble()); state = tupleType.newValue(time, value);} else {if(state.getTimestamp(0).compareTo(time)<0){ state.setTimestamp(0, time); state.setDouble(1, value);}}} return state;';
最终函数:
CREATE OR REPLACE FUNCTION finalVal ( state tuple<timestamp, double> ) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'return state.getDouble(1);';
聚合函数:
CREATE OR REPLACE AGGREGATE valueatlatesttime (timestamp, double) SFUNC curValState STYPE tuple<timestamp, double> FINALFUNC finalVal INITCOND null;
查询:
SELECT
stationid,
time1,
max(time2) AS max_time2,
valueatlatesttime(time2, value) AS value_at_max_time2
FROM
forec
WHERE
stationid = 1
AND
time1 >= '2020-10-21 05:00:00'
GROUP BY time1;
我需要找到具有给定 stationid、time1 大于指定时间和最大 time2 的行。
table是这样创建的:
CREATE TABLE forec (
stationid int,
time1 timestamp,
time2 timestamp,
value double,
PRIMARY KEY ((stationid), time1, time2)
) WITH CLUSTERING ORDER BY (time1 DESC)
假设 table 中的数据是这样的:
+------------+-----------------------+----------------------+--------+
| stationid | time1 | time2 | value |
+------------+-----------------------+----------------------+--------+
| 1 | 2020-10-21 06:00:00 | 2020-10-21 05:00:00 | 1 |
| 1 | 2020-10-21 06:00:00 | 2020-10-21 04:00:00 | 2 |
| 1 | 2020-10-21 06:00:00 | 2020-10-21 03:00:00 | 3 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 04:00:00 | 4 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 03:00:00 | 5 |
| 1 | 2020-10-21 04:00:00 | 2020-10-21 02:00:00 | 6 |
+------------+-----------------------+----------------------+--------+
我想查询: 给我所有 stationid = 1 和 time1 >= 2020-10-21 05:00:00 且 time2 具有最大值的行。查询应 return 以下行:
+------------+-----------------------+----------------------+--------+
| stationid | time1 | time2 | value |
+------------+-----------------------+----------------------+--------+
| 1 | 2020-10-21 06:00:00 | 2020-10-21 05:00:00 | 1 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 04:00:00 | 4 |
+------------+-----------------------+----------------------+--------+
我知道我可以这样查询:
SELECT * FROM forec WHERE stationid = 1 AND time1 >= '2020-10-21 05:00:00';
然后在客户端过滤结果(并只保留具有最大时间的行2),但是我想知道这是否可以更有效地完成(在 Cassandra 端过滤结果)。
或者我应该更改 table 模型?
编辑:根据 Cassandra document,“如果在没有聚合函数的情况下选择列,在带有 GROUP BY 的语句中,将返回每个组中遇到的第一个值。”因此,以下示例仅在 time2
以 DESC
顺序存储时才有效。
如果您使用的是最新版本的 Cassandra(例如 3.11.x),那么您可以使用 GROUP BY
来执行类似
SELECT
stationid,
time1,
max(time2) AS max_time2,
value
FROM
forec
WHERE
stationid = 1
AND
time1 >= '2020-10-21 05:00:00'
GROUP BY time1;
你得到
cqlsh:test> SELECT stationid, time1, max(time2) as max_time2, value FROM forec WHERE stationid = 1 AND time1 >= '2020-10-21 05:00:00' GROUP BY time1;
stationid | time1 | max_time2 | value
-----------+---------------------------------+---------------------------------+-------
1 | 2020-10-21 06:00:00.000000+0000 | 2020-10-21 05:00:00.000000+0000 | 1
1 | 2020-10-21 05:00:00.000000+0000 | 2020-10-21 04:00:00.000000+0000 | 4
(2 rows)
请注意,这会扫描您的分区,因此请注意分区大小,尤其是当您在集群列中使用时间戳时。
使用UDA/UDFs的解决方案:
状态函数:
CREATE OR REPLACE FUNCTION curValState ( state tuple<timestamp,double>, time timestamp, value double ) CALLED ON NULL INPUT RETURNS tuple<timestamp, double> LANGUAGE java AS 'if (time != null && value != null) { if(state == null) {com.datastax.driver.core.TupleType tupleType = com.datastax.driver.core.TupleType.of(com.datastax.driver.core.ProtocolVersion.NEWEST_SUPPORTED, com.datastax.driver.core.CodecRegistry.DEFAULT_INSTANCE, com.datastax.driver.core.DataType.timestamp(), com.datastax.driver.core.DataType.cdouble()); state = tupleType.newValue(time, value);} else {if(state.getTimestamp(0).compareTo(time)<0){ state.setTimestamp(0, time); state.setDouble(1, value);}}} return state;';
最终函数:
CREATE OR REPLACE FUNCTION finalVal ( state tuple<timestamp, double> ) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'return state.getDouble(1);';
聚合函数:
CREATE OR REPLACE AGGREGATE valueatlatesttime (timestamp, double) SFUNC curValState STYPE tuple<timestamp, double> FINALFUNC finalVal INITCOND null;
查询:
SELECT
stationid,
time1,
max(time2) AS max_time2,
valueatlatesttime(time2, value) AS value_at_max_time2
FROM
forec
WHERE
stationid = 1
AND
time1 >= '2020-10-21 05:00:00'
GROUP BY time1;