Cassandra:查找分区键
Cassandra: Finding partition keys
我们目前正在使用以下 table 架构测试 Cassandra:
CREATE TABLE coreglead_v2.stats_by_site_user (
d_tally text, -- ex.: '2016-01', '2016-02', etc..
site_id int,
d_date timestamp,
site_user_id int,
accepted counter,
error counter,
impressions_negative counter,
impressions_positive counter,
rejected counter,
revenue counter,
reversals_rejected counter,
reversals_revenue counter,
PRIMARY KEY (d_tally, site_id, d_date, site_user_id)
) WITH CLUSTERING ORDER BY (site_id ASC, d_date ASC, site_user_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
为了我们的测试目的,我们编写了一个 python 脚本来随机化 2016 年日历中的数据(总共 12 个月),我们希望我们的分区键是 d_tally列,同时,我们期望我们的键数为12(从'2016-01'到'2016-12')。
运行 nodetool cfstats 向我们展示了以下内容:
Table: stats_by_site_user
SSTable count: 4
Space used (live): 131977793
Space used (total): 131977793
Space used by snapshots (total): 0
Off heap memory used (total): 89116
SSTable Compression Ratio: 0.18667406304929424
Number of keys (estimate): 24
Memtable cell count: 120353
Memtable data size: 23228804
Memtable off heap memory used: 0
Memtable switch count: 10
Local read count: 169
Local read latency: 1.938 ms
Local write count: 4912464
Local write latency: 0.066 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 128
Bloom filter off heap memory used: 96
Index summary off heap memory used: 76
Compression metadata off heap memory used: 88944
Compacted partition minimum bytes: 5839589
Compacted partition maximum bytes: 43388628
Compacted partition mean bytes: 16102786
Average live cells per slice (last five minutes): 102.91627247589237
Maximum live cells per slice (last five minutes): 103
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
让我们感到困惑的是 "Number of keys (estimate): 24" 部分。查看我们的架构并假设我们的测试数据(超过 500 万次写入)仅由 2016 年的数据组成,那么 24 个键的估计值从何而来?
这是我们的数据示例:
d_tally | site_id | d_date | site_user_id | accepted | error | impressions_negative | impressions_positive | rejected | revenue | reversals_rejected | reversals_revenue
---------+---------+--------------------------+--------------+----------+-------+----------------------+----------------------+----------+---------+--------------------+-------------------
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 240054 | 1 | null | null | 1 | null | 553 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1263968 | 1 | null | null | 1 | null | 1093 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1267841 | 1 | null | null | 1 | null | 861 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1728725 | 1 | null | null | 1 | null | 425 | null | null
键的数量是一个估计值(尽管应该非常接近)。它从每个 sstable 获取数据草图,并将其合并在一起以估计基数 (hyperloglog)。
不幸的是,memtable 中不存在等效项,因此它将 memtable 的基数添加到 sstable 估计中。这意味着 memtables 和 sstables 中的东西都会被重复计算。这就是为什么您看到 24
而不是 12
。
我们目前正在使用以下 table 架构测试 Cassandra:
CREATE TABLE coreglead_v2.stats_by_site_user (
d_tally text, -- ex.: '2016-01', '2016-02', etc..
site_id int,
d_date timestamp,
site_user_id int,
accepted counter,
error counter,
impressions_negative counter,
impressions_positive counter,
rejected counter,
revenue counter,
reversals_rejected counter,
reversals_revenue counter,
PRIMARY KEY (d_tally, site_id, d_date, site_user_id)
) WITH CLUSTERING ORDER BY (site_id ASC, d_date ASC, site_user_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
为了我们的测试目的,我们编写了一个 python 脚本来随机化 2016 年日历中的数据(总共 12 个月),我们希望我们的分区键是 d_tally列,同时,我们期望我们的键数为12(从'2016-01'到'2016-12')。
运行 nodetool cfstats 向我们展示了以下内容:
Table: stats_by_site_user
SSTable count: 4
Space used (live): 131977793
Space used (total): 131977793
Space used by snapshots (total): 0
Off heap memory used (total): 89116
SSTable Compression Ratio: 0.18667406304929424
Number of keys (estimate): 24
Memtable cell count: 120353
Memtable data size: 23228804
Memtable off heap memory used: 0
Memtable switch count: 10
Local read count: 169
Local read latency: 1.938 ms
Local write count: 4912464
Local write latency: 0.066 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 128
Bloom filter off heap memory used: 96
Index summary off heap memory used: 76
Compression metadata off heap memory used: 88944
Compacted partition minimum bytes: 5839589
Compacted partition maximum bytes: 43388628
Compacted partition mean bytes: 16102786
Average live cells per slice (last five minutes): 102.91627247589237
Maximum live cells per slice (last five minutes): 103
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
让我们感到困惑的是 "Number of keys (estimate): 24" 部分。查看我们的架构并假设我们的测试数据(超过 500 万次写入)仅由 2016 年的数据组成,那么 24 个键的估计值从何而来?
这是我们的数据示例:
d_tally | site_id | d_date | site_user_id | accepted | error | impressions_negative | impressions_positive | rejected | revenue | reversals_rejected | reversals_revenue
---------+---------+--------------------------+--------------+----------+-------+----------------------+----------------------+----------+---------+--------------------+-------------------
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 240054 | 1 | null | null | 1 | null | 553 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1263968 | 1 | null | null | 1 | null | 1093 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1267841 | 1 | null | null | 1 | null | 861 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1728725 | 1 | null | null | 1 | null | 425 | null | null
键的数量是一个估计值(尽管应该非常接近)。它从每个 sstable 获取数据草图,并将其合并在一起以估计基数 (hyperloglog)。
不幸的是,memtable 中不存在等效项,因此它将 memtable 的基数添加到 sstable 估计中。这意味着 memtables 和 sstables 中的东西都会被重复计算。这就是为什么您看到 24
而不是 12
。