Cassandra 中的时间序列模式设计

Question

全部，

我们正在为基于 iOT 的应用程序做 POC。选择的数据库是cassandra。我们将从安装在车辆上的设备接收时间序列数据。时间序列数据的主要属性如下

TimeStamp :- 表示接收数据的日期和时间
DeviceId :-安装在车辆上的设备的UniqueId
纬度车辆当前纬度
经度车辆当前经度
车速

我们计划将月份和年份作为分区键，将设备 ID 和时间戳作为聚类键...这是使用以下类型的查询获取数据的最佳方式吗

使用开始之间的 DeviceId 检索设备的数据日期和结束日期
检索开始日期和结束日期之间所有设备的数据日期

提前致谢

Answer 1

不知道，但是使用 ELK/elastic 搜索作为您的时间序列数据库怎么样...

Answer 2

Cassandra 中的数据建模最好使用查询驱动方法完成。在为 Cassandra 建模时，请参阅此 blog post 以了解 "Rules"。

Rule 1: Spread Data Evenly Around the Cluster

Rule 2: Minimize the Number of Partitions Read

您在问题中提供了 2 个查询，仅范围不同。一种是按设备 ID 请求时间范围内的数据，另一种是与设备 ID 无关的时间范围内的数据。

Retrieve the data for a device with the DeviceId between a start date and end date

Retrieve the data for all devices between a start date and end date

您的 table(s) 应支持的查询如下所示：

设备 x 在时间段 y 内的纬度、经度和速度是多少

分区时要考虑数据点的个数。正常的时间框架是什么时候？是按分钟、小时、天、周、月？该时间范围应该有助于确定写入和分区的处理方式。如果您按月份和年份进行分区，则适用于每月不超过 20 亿个读数的传感器读数。请参阅 this SO answer 了解有关围绕限制进行分区的详细解释。

了解分区是启用范围结果集的关键。请参阅以下摘自 "Deep look at the CQL WHERE clause".

您将无法在分区键上使用 <、> 运算符。（ALLOW FILTERING 可以解决这个问题，但不要将其作为核心架构设计的一部分。）运算符必须用于集群列。

Cassandra distributes the partition accross the nodes using the selected partitioner. As only the ByteOrderedPartitioner keeps an ordered distribution of data Cassandra does not support >, >=, <= and < operator directly on the partition key.

Instead, it allows you to use the >, >=, <= and < operator on the partition key through the use of the token function.

SELECT * FROM numberOfRequests
    WHERE token(cluster, date) > token('cluster1', '2015-06-03')
    AND token(cluster, date) <= token('cluster1', '2015-06-05')
    AND time = '12:00';

Cassandra 中的时间序列模式设计

Time Series schema design in Cassandra

schema

time-series

cassandra

datastax-enterprise

datastax