在 ClickHouse 中使用 SAMPLE 似乎读取了所有行和更多字节。这是预期的还是由于次优 table 定义?

Using SAMPLE in ClickHouse seems to read all rows and more bytes. Is this expected or due to sub-optimal table definition?

我希望通过使用 SAMPLE 以准确性换取速度。虽然某些查询 运行 速度更快,但 query_log 显示正在读取所有行并且读取的字节数比不使用 SAMPLE 时多。我认为 SAMPLE 会导致更少的行读取。

我的问题:

  1. 使用 SAMPLE 是否会导致读取更多的行和字节?

  2. 阅读量增加是因为次优的 table 定义可以更正吗?

我使用的是 ClickHouse 版本 20.5.3 修订版 54435。

Table定义:

CREATE TABLE default.table_one
(
  `datestamp` Date,
  `timestamp` UInt64,
  `sample_hash` UInt64,
  `value` UInt32,
  ...
)
ENGINE = MergeTree()
PARTITION BY date_stamp
ORDER BY (datestamp, timestamp, sample_hash)
SAMPLE BY sample_hash
SETTINGS index_granularity = 8192

没有样本的查询

SELECT
  avg(value)
FROM default.table_one;

query_duration_ms: 166
rows_read: 100,000,000
read_bytes: 800,000,000

使用 SAMPLE 查询

SELECT
  avg(value)
FROM default.table_one
SAMPLE 0.1;

query_duration_ms: 358
rows_read: 100,000,000
read_bytes: 1,600,000,000

简短回答:是的。因为CH需要再读一栏sample_hash.

长答案:采样很难。如果您每天有 1000 亿行和 400 台服务器,这将很有用。它对 groupbys 有很大帮助。它对过滤没有帮助,因为在您的情况下它不能与主索引一起使用。 Yandex 为自己设计了抽样。他们启用了强制 partitioningkeys/primarykeys 用法 (force_index_by_date/force_primary_key)。所以像你这样的查询在他们的系统中是不可能的,因此即使在磁盘读取方面,采样也能帮助他们。

这就是我不在我的系统中使用采样的原因。

但是

ORDER BY (datestamp, timestamp, sample_hash)

还有这样的ORDER BY一点用都没有。这整个 table 是错误的设计。 将 (datestamp 放在索引前缀中没有意义,因为 table 被 datestamp 分区,因此每个分区只有一个 datestamp 值。

索引前缀中的

timestamp 是一个更大的问题,因为将高度基数的列放在主索引的开头是非常不明智的。

所以。我可以创建一个合成示例并展示 sampling 的工作原理。但这有什么意义吗?

CREATE TABLE table_one
( timestamp UInt64,
  transaction_id UInt64,
  banner_id UInt16,
  value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)),  cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192


insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
from numbers(10000000000);



select banner_id, sum(value), count(value), max(value)
from table_one 
group by banner_id format Null;

0 rows in set. Elapsed: 11.490 sec. Processed 10.00 billion rows, 60.00 GB (870.30 million rows/s., 5.22 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;

0 rows in set. Elapsed: 1.316 sec. Processed 452.67 million rows, 6.34 GB (343.85 million rows/s., 4.81 GB/s.)



select banner_id, sum(value), count(value), max(value)
from table_one 
WHERE banner_id = 42
group by banner_id format Null;

0 rows in set. Elapsed: 0.020 sec. Processed 10.30 million rows, 61.78 MB (514.37 million rows/s., 3.09 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;

0 rows in set. Elapsed: 0.008 sec. Processed 696.32 thousand rows, 9.75 MB (92.49 million rows/s., 1.29 GB/s.)




select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one 
group by banner_id, hr format Null;
0 rows in set. Elapsed: 36.660 sec. Processed 10.00 billion rows, 140.00 GB (272.77 million rows/s., 3.82 GB/s.)

select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 3.741 sec. Processed 452.67 million rows, 9.96 GB (121.00 million rows/s., 2.66 GB/s.)




select count()
from table_one 
where value = 666 format Null;
1 rows in set. Elapsed: 6.056 sec. Processed 10.00 billion rows, 40.00 GB (1.65 billion rows/s., 6.61 GB/s.)

select count()
from table_one  SAMPLE 0.01
where value = 666 format Null;
1 rows in set. Elapsed: 1.214 sec. Processed 452.67 million rows, 5.43 GB (372.88 million rows/s., 4.47 GB/s.)

困难部分: 下面是主索引中高基数列如何影响的示例。 相同的table,相同的数据但不是

ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)), cityHash64(transaction_id))

我用过

ORDER BY (banner_id, timestamp, cityHash64(transaction_id))

CREATE TABLE table_one
( timestamp UInt64,
  transaction_id UInt64,
  banner_id UInt16,
  value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id, timestamp, cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192

insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
from numbers(10000000000);



select banner_id, sum(value), count(value), max(value)
from table_one 
group by banner_id format Null;
0 rows in set. Elapsed: 11.196 sec. Processed 10.00 billion rows, 60.00 GB (893.15 million rows/s., 5.36 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;
0 rows in set. Elapsed: 24.378 sec. Processed 10.00 billion rows, 140.00 GB (410.21 million rows/s., 5.74 GB/s.)



select banner_id, sum(value), count(value), max(value)
from table_one 
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.022 sec. Processed 10.27 million rows, 61.64 MB (459.28 million rows/s., 2.76 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.037 sec. Processed 10.27 million rows, 143.82 MB (275.16 million rows/s., 3.85 GB/s.)



select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one 
group by banner_id, hr format Null;
0 rows in set. Elapsed: 21.663 sec. Processed 10.00 billion rows, 140.00 GB (461.62 million rows/s., 6.46 GB/s.)


select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 26.697 sec. Processed 10.00 billion rows, 220.00 GB (374.57 million rows/s., 8.24 GB/s.)



select count()
from table_one 
where value = 666 format Null;
0 rows in set. Elapsed: 7.679 sec. Processed 10.00 billion rows, 40.00 GB (1.30 billion rows/s., 5.21 GB/s.)

select count()
from table_one  SAMPLE 0.01
where value = 666 format Null;
0 rows in set. Elapsed: 21.668 sec. Processed 10.00 billion rows, 120.00 GB (461.51 million rows/s., 5.54 GB/s.)

高基数列时间戳使得无法在 cityHash64(transaction_id)) 的索引中使用范围查找。 CH 每 mark 读取 0.01 块。 这是预期的行为,对于任何数据库或任何排序列表也是如此。

现在 CH 读取所有带 .001 采样和不带采样的行。