如何通过最近的时间戳有效地连接两个巨大的表？

Question

我有两个巨大的 table，A 和 B。Table A 有大约 5 亿行时间序列数据。 Table B 有大约 1000 万行时间序列数据。为了简化，我们可以假设它们由以下列组成：

Table一个

factory	machine	timestamp_1	part	suplement
1	1	2022-01-01 23:54:01	1	1
1	1	2022-01-01 23:54:05	1	2
1	1	2022-01-01 23:54:10	1	3
...	...	...	...

Table B

machine	timestamp_2	measure
1	2022-01-01 23:54:00	0
1	2022-01-01 23:54:07	10
1	2022-01-01 23:54:08	0
...	...	...

我想创建一个 table C，它是通过匹配 timestamp_1 的每个值“加入”两个 table 的结果table A 的 timestamp_2 的 table B 的 measure 为 0 的最接近值，以及也适用于相同的 factory 和 machine。对于 table A 的 part = 1 值，我也只需要这个。对于上面的小示例，生成的 table C 将具有相同数量的行作为 A，看起来像：

Table C

machine	timestamp_1	time_since_measure_0
1	2022-01-01 23:54:01	1
1	2022-01-01 23:54:05	5
1	2022-01-01 23:54:10	2
...	...	...

一些同样需要考虑的重要事项是：

Table A 在列 (factory, machine, timestamp_1, part, supplement) 上有一个索引。该索引对于与此无关的其他查询非常有用。 Table B 在列 (machine, timestamp_2, measure).
Table A 是由 (factory, timestamp_1) 分区 table 的压缩 timescaleDB。这也是因为其他查询。 Table B 是 postgresql vanilla table.

我用下面的语句创建了table C:

create table C (
    machine int4 not null,
    timestamp_1 timestamptz,
    time_since_measure_0 interval,
    constraint C primary key (machine,timestamp_1)
)

然后我尝试将此代码 select 并将数据插入 table C:

insert into C (
    select
        factory,
        machine,
        timestamp_1,
        timestamp_1  - (
            select timestamp_2
            from B
            where 
                A.machine = B.machine
                and B.measure = 0 
                and B.timestamp_2 <= A.timestamp_1
            order by B.timestamp_2 desc
            limit 1
        ) as "time_since_measure_0"
    from A
    where A.part = 1
)

但是，这似乎需要很长时间。我知道我正在处理非常大的 tables，但是我是否遗漏了什么或者我该如何优化它？

Answer 1

因为我们当然无法访问您的表并且您还没有发布查询计划，所以除了进行一些一般性观察外很难做更多的事情。您描述的索引似乎对该查询没有用。查看您的查询，在我看来您需要添加以下索引：

Table A
  Index on (machine, timestamp_1)

Table B
  Index on (machine, measure, timestamp_2)

试一试，看看会发生什么。

Answer 2

你要的叫“as-of加入”。将每个时间戳连接到另一个 table.

中最接近的值

一些time-series数据库，比如clickhouse，直接support这个。这是使其快速的唯一方法。它与合并连接非常相似，但有一些修改：引擎必须按时间戳顺序扫描两个 table，并连接到最近的值行而不是等值行。

我已经对其进行了简要调查，timescaledb 似乎不支持它，但此 post 显示了使用横向连接和覆盖索引的解决方法。这可能与您的查询具有相似的性能，因为它将使用嵌套循环和 index-only 扫描来为每一行选择最接近的值。

如何通过最近的时间戳有效地连接两个巨大的表？

How to efficiently join two huge tables by nearest timestamp?

postgresql

timescaledb