Cassandra：TTL vs 动态表 vs 大量删除

Question

我基本上有一个 data table 这样的（一个分区 id，以及一个序列化值 serialized_value）：

CREATE TABLE keyspace.data (
    id bigint,
    serialized_value blob,
    PRIMARY KEY (id)
) WITH caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
  AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'enabled': 'true'}
    AND compression = { 'class' : 'LZ4Compressor'};

用例涉及维护数据的多个版本（serialized_value 对于给定的 id）。

每天，我都必须向 Cassandra 发送新版本的数据。每次涉及1亿rows/partitions

当然，我不需要维护所有版本的数据，只需要维护最近 4 天的数据（所以最近的四天 version_id）。

我确定了三个解决方案：

解决方案 1：TTL

想法是在插入时设置 TTL。通过这种方式，最旧版本的数据会自动删除，而不会出现与 thombstone 相关的问题。

pros :

no read performance penalty (?)

no problem related to thombstones

cons :

if fails occur with ingestion several days, I may loose all the data from the Cassandra cluster because of TTL automatic delete

解决方案 2：动态 tables

table 创作变成：

CREATE TABLE keyspace.data_{version_id} (
    id bigint,
    serialized_value blob,
    PRIMARY KEY (id)
) ...;

table 名称包括 version_id。

pros :

the table (corresponding to a version) is easy to delete

no read performance penalty

no problem related to thombstones

cons :

dynamically adding a table to the cluster might need all the nodes to be up every time.

a bit more difficult to handle client side (query specific table name, instead of the same one)

方案三：大量删除

在这种情况下，所有数据都保留在一个 table 中，并且 version_id 被添加到主键。

CREATE TABLE keyspace.data (
    version_id int,
    id bigint,
    serialized_value blob,
    PRIMARY KEY ((version_id,id))
) ...;

pros :

only one single table to create and maintain, for the entire application lifecycle

cons :

read performance penalty may occurs because of lot of thombstones

problem related to thombstones, because large amount of data need to be deleted, in order to purge all data related to old version_id.

the delete will only match the exact partition key, so it will generate partition thombstones and NOT cell thombstones. but thus, I'm afraid of the performance of doing that..

您实现这一目标的最佳方式是什么？ :-)

Answer 1

最好根据倒序排序的日期或时间戳对数据进行聚类，但仍设置 TTL。例如：

CREATE TABLE ks.blobs_by_id (
    id bigint,
    version timestamp,
    serialized_value blob,
    PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC)

如果您在 table 上有默认 TTL，旧版本将自动过期，因此当您使用以下内容检索行时：

SELECT ... FROM blobs_by_id WHERE id = ? LIMIT 4

只会返回最近的 4 行（按降序排列），您不会遍历已删除的行。干杯！

Cassandra：TTL vs 动态表 vs 大量删除

Cassandra : TTL vs dynamic tables vs large amount of deletes

cassandra

datastax-enterprise

datastax