在 Cassandra 中处理 uncompactable/overlapping sstables

Question

我们有一个新集群运行ning Cassandra 2.2.14，并将压缩留给 "sort themselves out"。这是在我们的 UAT 环境中，所以负载很低。我们运行 STCS.

我们看到不断增长的墓碑。我知道一旦 sstable 符合压缩条件，压缩最终会处理数据。这对我们来说并不经常发生，所以我启用了一些设置作为测试（我知道它们很激进，这纯粹是为了测试）：

'tombstone_compaction_interval': '120', 
'unchecked_tombstone_compaction': 'true', 
'tombstone_threshold': '0.2', 
'min_threshold': '2'

这确实导致了一些压缩的发生，但是删除的墓碑数量很少，也没有低于阈值 (0.2)。应用这些设置后，这是我从 sstablemetadata 中看到的内容：

Estimated droppable tombstones: 0.3514636277302944
Estimated droppable tombstones: 0.0
Estimated droppable tombstones: 6.007563159628437E-5

请注意，这只是一个 CF，还有更糟糕的 CF（90% 墓碑等）。以此为例，但所有 CF 都出现了相同的症状。

表格统计：

               SSTable count: 3
                Space used (live): 3170892738
                Space used (total): 3170892738
                Space used by snapshots (total): 3170892750
                Off heap memory used (total): 1298648
                SSTable Compression Ratio: 0.8020960426857765
                Number of keys (estimate): 506775
                Memtable cell count: 4
                Memtable data size: 104
                Memtable off heap memory used: 0
                Memtable switch count: 2
                Local read count: 2161
                Local read latency: 14.531 ms
                Local write count: 212
                Local write latency: NaN ms
                Pending flushes: 0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 645872
                Bloom filter off heap memory used: 645848
                Index summary off heap memory used: 192512
                Compression metadata off heap memory used: 460288
                Compacted partition minimum bytes: 61
                Compacted partition maximum bytes: 5839588
                Compacted partition mean bytes: 8075
                Average live cells per slice (last five minutes): 1.0
                Maximum live cells per slice (last five minutes): 1
                Average tombstones per slice (last five minutes): 124.0
                Maximum tombstones per slice (last five minutes): 124

这里显而易见的答案是墓碑不符合移除条件。

gc_grace_seconds设置为10天，没有动过。我将其中一个 sstables 转储到 json，我可以看到可追溯到 2019 年 4 月的墓碑：

{"key": "353633393435353430313436373737353036315f657370a6215211e68263740a8cc4fdec",
 "cells": [["d62cf4f420fb11e6a92baabbb43c0a93",1566793260,1566793260977489,"d"],
           ["d727faf220fb11e6a67702e5d23e41ec",1566793260,1566793260977489,"d"],
           ["d7f082ba20fb11e6ac99efca1d29dc3f",1566793260,1566793260977489,"d"],
           ["d928644a20fb11e696696e95ac5b1fdd",1566793260,1566793260977489,"d"],
           ["d9ff10bc20fb11e69d2e7d79077d0b5f",1566793260,1566793260977489,"d"],
           ["da935d4420fb11e6a960171790617986",1566793260,1566793260977489,"d"],
           ["db6617c020fb11e6925271580ce42b57",1566793260,1566793260977489,"d"],
           ["dc6c40ae20fb11e6b1163ce2bad9d115",1566793260,1566793260977489,"d"],
           ["dd32495c20fb11e68f7979c545ad06e0",1566793260,1566793260977489,"d"],
           ["ddd7d9d020fb11e6837dd479bf59486e",1566793260,1566793260977489,"d"]]},

所以我不认为 gc_grace_seconds 是这里的问题。我对列族文件夹中的每个 Data.db 文件进行了运行手动用户定义压缩（仅限单个 Data.db 文件，一次一个）。压缩运行，但墓碑值几乎没有变化。旧数据仍然存在。

我可以确认确实在昨天进行了维修。我还可以确认维修已运行定期进行，日志中没有显示任何问题。

所以维修没问题。压实很好。我能想到的就是重叠的 SSTables。

最终测试是运行对列族进行完全压缩。我使用 JMXterm 在 3 个 SSTables 上执行了用户定义的（不是 nodetool compact）。这导致了一个单一的 SSTable 文件，具有以下内容：

Estimated droppable tombstones: 9.89886650537452E-6

如果我查找上面的示例 EPOCH (1566793260)，它是不可见的。也不是关键。所以它被压缩了或者Cassandra做了什么。在 1.2 亿行转储中，包含墓碑 ("d") 标志的行总数为 1317 行。 EPOCH 值都在 10 天内。好。

所以我假设 -6 值是一个非常小的百分比，并且 sstablemetadata 在显示它时遇到问题。那么，成功对吗？但要拆除旧墓碑需要进行全面压实。据我所知，完全压实只是最后的努力。

我的问题是 -

如何确定重叠的 sstables 是否是我的问题？我看不出数据不会压缩的任何其他原因，除非它是重叠相关的。
如何在不执行完全压缩的情况下解决重叠的 sstables？恐怕这只会在几周后再次发生。我不想陷入必须定期执行完全压缩以防止墓碑的困境。
创建重叠 sstables 的原因是什么？这是数据设计问题还是其他问题？

干杯。

Answer 1

回答您的问题：

How can I determine if overlapping sstables is my issue here? I cant see any other reason why the data would not compact out unless it is overlapping related.

如果墓碑不是使用 TTL 生成的，更多时候墓碑和阴影数据可能会位于不同的 sstable 中。当使用 STCS 并且集群写入量较低时，很少会触发压缩，这会导致墓碑停留时间过长。如果你有墓碑的分区键，节点上的运行 nodetool getsstables -- <keyspace> <table> <key> 将 return 所有包含本地节点中键的 sstables。您可以转储 sstable 内容以确认。

How can I resolve overlapping sstables, without performing a full compaction? I am afraid this is simply going to reoccur in a few weeks time. I don't want to get stuck having to perform full compactions regularly to keep tombstones at bay.

"nodetool compaction -s" 中有一个新选项，它可以进行主要压缩并将输出拆分为 4 个不同大小的 sstables。这解决了之前创建单个大型 sstable 的主要压缩的问题。如果可丢弃的墓碑比例高达 80-90%，则由于大多数墓碑已被清除，因此最终的 sstable 大小将更小。

在较新版本的Cassandra（3.10+）中，有一个新的工具nodetool garbagecollect 来清理墓碑。但是，此工具存在局限性。不是所有的墓碑都能被它移除。

综上所述，对于存在重叠的 sstables 和压缩频率低的 activities/less 的情况，您必须找出所有相关的 sstables 并使用用户定义的压缩，或者使用 major 压缩“-s”。 https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsCompact.html

What are the reasons for the creation of overlapping sstables? Is this a data design problem, or some other issue?

快速增长的墓碑通常表明数据建模存在问题：应用程序是否正在插入空值，或定期删除数据，或使用收集并进行更新而不是追加。如果您的数据是时间序列，请检查使用 TTL 和 TWCS 是否有意义。

在 Cassandra 中处理 uncompactable/overlapping sstables

Dealing with uncompactable/overlapping sstables in Cassandra

cassandra

tombstone