当我不执行 DELETE 时，为什么有时会有 10,000 多个墓碑？

Question

在修复 Cassandra 节点时，有时会看到很多逻辑删除日志。错误如下所示：

org.apache.cassandra.db.filter.TombstoneOverwhelmingException: Scanned over 100001 tombstone rows during query 'SELECT * FROM my_keyspace.table_foo WHERE token(<my params>) >= token(<my params>) AND token(<my params>) <= 2988334221698479200 LIMIT 2147385647' (last scanned row partition key was ((<my params>), 7c650d21-797e-4476-93d5-b1248e187f22)); query aborted

我读到 here 插入墓碑是将记录标记为已删除的一种方式。但是，我在这个项目中没有看到任何代码对此 table 运行删除 - 只是读取和插入。我缺少什么 - 如何防止这些 TombStoneOverwhelmingExceptions？

这里是 table 定义：

CREATE TABLE my_keyspace.table_foo(
    foo1 text,
    year int,
    month int,
    foo2 text,
    PRIMARY KEY ((foo1, year, month), foo2)
) WITH CLUSTERING ORDER BY (foo2 ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND default_time_to_live = 6912000
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99PERCENTILE';

Answer 1

@anthony，这是我的视角。

第一步，不要让墓碑插入table
在读取路径中使用完整的主键，这样我们就不必读取墓碑了。数据建模是根据阅读端所需的访问模式设计 table 的关键
我们可以调整 min_threshold 并将其设置为 2 以执行一些积极的逻辑删除
同样，我们可以调整 common options（例如 unchecked_tombstone_compaction 设置为 true 或其他 properties/options）以更快地驱逐它们
我鼓励您查看 类似的 问题和已记录的答案 here

Answer 2

However, I don't see any code in this project that runs a delete on this table - just a read and an insert.

代码可能不是运行 DELETEs，但 table 定义告诉 Cassandra 删除任何大于等于 80 天的内容。 TTL 创建墓碑。

AND default_time_to_live = 6912000

因此，时间序列模型中 TTL 背后的想法是，它们通常按时间戳降序排列。最终发生的事情是，大多数用例往往只关心最近的数据，并且按时间戳的降序导致墓碑最终出现在分区的“底部”，在那里它们很少（如果有的话）被查询。

要创建该效果，您需要创建一个新的 table，其定义如下：

PRIMARY KEY ((foo1, year, month), created_time, foo2)
) WITH CLUSTERING ORDER BY (created_time DESC, foo2 ASC)

当我不执行 DELETE 时，为什么有时会有 10,000 多个墓碑？

Why do I sometimes have 10,000+ tombstones when I don't do DELETEs?

cassandra