涉及两个连接的 CTE 的奇怪行为
Strange behaviour with a CTE involving two joins
这个 post 已经完全改写以使问题更容易理解。
设置
PostgreSQL 9.5
运行 宁 Ubuntu Server 14.04 LTS
。
数据模型
我有数据集 tables,我在其中单独存储数据(时间序列),所有这些 tables 必须共享相同的结构:
CREATE TABLE IF NOT EXISTS %s(
Id SERIAL NOT NULL,
ChannelId INTEGER NOT NULL,
GranulityIdIn INTEGER,
GranulityId INTEGER NOT NULL,
TimeValue TIMESTAMP NOT NULL,
FloatValue FLOAT DEFAULT(NULL),
Status BIGINT DEFAULT(NULL),
QualityCodeId INTEGER NOT NULL,
DataArray FLOAT[] DEFAULT(NULL),
DataCount BIGINT DEFAULT(NULL),
Performance FLOAT DEFAULT(NULL),
StepCount INTEGER NOT NULL DEFAULT(0),
TableRegClass regclass NOT NULL,
Updated TIMESTAMP NOT NULL,
Tags TEXT[] DEFAULT(NULL),
--
CONSTRAINT PK_%s PRIMARY KEY(Id),
CONSTRAINT FK_%s_Channel FOREIGN KEY(ChannelId) REFERENCES scientific.Channel(Id),
CONSTRAINT FK_%s_GranulityIn FOREIGN KEY(GranulityIdIn) REFERENCES quality.Granulity(Id),
CONSTRAINT FK_%s_Granulity FOREIGN KEY(GranulityId) REFERENCES quality.Granulity(Id),
CONSTRAINT FK_%s_QualityCode FOREIGN KEY(QualityCodeId) REFERENCES quality.QualityCode(Id),
CONSTRAINT UQ_%s UNIQUE(QualityCodeId, ChannelId, GranulityId, TimeValue)
);
CREATE INDEX IDX_%s_Channel ON %s USING btree(ChannelId);
CREATE INDEX IDX_%s_Quality ON %s USING btree(QualityCodeId);
CREATE INDEX IDX_%s_Granulity ON %s USING btree(GranulityId) WHERE GranulityId > 2;
CREATE INDEX IDX_%s_TimeValue ON %s USING btree(TimeValue);
这个定义来自一个FUNCTION
,因此%s
代表数据集名称。
UNIQUE
约束确保给定数据集中不能有重复记录。此数据集中的记录是给定通道 (channelid
) 的值 (floatvalue
),在给定时间间隔 (granulityid
) 的给定时间 (timevalue
) 采样,具有给定的质量 (qualitycodeid
)。无论值是什么,都不能有 (channelid, timevalue, granulityid, qualitycodeid)
.
的重复项
数据集中的记录如下:
1;25;;1;"2015-01-01 00:00:00";0.54;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
2;25;;1;"2015-01-01 00:30:00";0.49;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
3;25;;1;"2015-01-01 01:00:00";0.47;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
我还有另一颗卫星 table,我在其中存储频道的有效数字,此参数会随时间变化。我按以下方式存储它:
CREATE TABLE SVPOLFactor (
Id SERIAL NOT NULL,
ChannelId INTEGER NOT NULL,
StartTimestamp TIMESTAMP NOT NULL,
Factor FLOAT NOT NULL,
UnitsId VARCHAR(8) NOT NULL,
--
CONSTRAINT PK_SVPOLFactor PRIMARY KEY(Id),
CONSTRAINT FK_SVPOLFactor_Units FOREIGN KEY(UnitsId) REFERENCES Units(Id),
CONSTRAINT UQ_SVPOLFactor UNIQUE(ChannelId, StartTimestamp)
);
当为通道定义了有效数字时,将向此添加一行 table。那么该因素自该日期起适用。第一条记录始终具有标记值 '-infinity'::TIMESTAMP
,这意味着:该因素从一开始就适用。下一行必须有一个真正定义的值。如果给定通道没有行,则表示有效数字是单一的。
table 中的记录如下:
123;277;"-infinity";0.1;"_C"
124;1001;"-infinity";0.01;"-"
125;1001;"2014-03-01 00:00:00";0.1;"-"
126;1001;"2014-06-01 00:00:00";1;"-"
127;1001;"2014-09-01 00:00:00";10;"-"
5001;5181;"-infinity";0.1;"ug/m3"
目标
我的目标是对由不同进程填充的两个数据集执行比较审计。要实现它,我必须:
- 比较数据集之间的记录并评估它们的差异;
- 检查相似记录之间的差异是否包含在有效数字内。
为此,我编写了以下查询,其行为方式我不理解:
WITH
-- Join records before records (regard to uniqueness constraint) from datastore templated tables in order to make audit comparison:
S0 AS (
SELECT
A.ChannelId
,A.GranulityIdIn AS gidInRef
,B.GranulityIdIn AS gidInAudit
,A.GranulityId AS GranulityId
,A.QualityCodeId
,A.TimeValue
,A.FloatValue AS xRef
,B.FloatValue AS xAudit
,A.StepCount AS scRef
,B.StepCount AS scAudit
,A.DataCount AS dcRef
,B.DataCount AS dcAudit
,round(A.Performance::NUMERIC, 4) AS pRef
,round(B.Performance::NUMERIC, 4) AS pAudit
FROM
datastore.rtu AS A JOIN datastore.audit0 AS B USING(ChannelId, GranulityId, QualityCodeId, TimeValue)
),
-- Join before SVPOL factors in order to determine decimal factor applied to records:
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
),
-- Audit computation:
S2 AS (
SELECT
S1.*
,xaudit - xref AS dx
,(xaudit - xref)/NULLIF(xref, 0) AS rdx
,round(xaudit*pow(10, k))*pow(10, -k) AS xroundfloat
,round(xaudit::NUMERIC, k) AS xroundnum
,0.5*pow(10, -k) AS epsilon
FROM S1
)
SELECT
*
,ABS(dx) AS absdx
,ABS(rdx) AS absrdx
,(xroundfloat - xref) AS dxroundfloat
,(xroundnum - xref) AS dxroundnum
,(ABS(dx) - epsilon) AS dxeps
,(ABS(dx) - epsilon)/epsilon AS rdxeps
,(xroundfloat - xroundnum) AS dfround
FROM
S2
ORDER BY
k DESC
,ABS(rdx) DESC
,ChannelId;
该查询可能有些难以理解,我大致希望它:
- 使用唯一性约束连接来自两个数据集的数据以比较相似记录并计算差异 (
S0
);
- 对于每个差异,找到适用于当前时间戳 (
S1
) 的有效数字 (LEFT JOIN
);
- 执行一些其他有用的统计数据(
S2
和最终的 SELECT
)。
问题
当我运行上面的查询时,我缺少行。例如:channelid=123
和 granulityid=4
在两个 table 中都有 12 条记录(datastore.rtu
和 datastore.audit0
)。当我执行整个查询并将其存储在 MATERIALIZED VIEW
中时,只有不到 12 行。然后我开始调查以了解为什么我缺少记录并且我遇到了 WHERE
子句的奇怪行为。如果我执行此查询的 EXPLAIN ANALIZE
,我得到:
"Sort (cost=332212.76..332212.77 rows=1 width=232) (actual time=6042.736..6157.235 rows=61692 loops=1)"
" Sort Key: s2.k DESC, (abs(s2.rdx)) DESC, s2.channelid"
" Sort Method: external merge Disk: 10688kB"
" CTE s0"
" -> Merge Join (cost=0.85..332208.25 rows=1 width=84) (actual time=20.408..3894.071 rows=63635 loops=1)"
" Merge Cond: ((a.qualitycodeid = b.qualitycodeid) AND (a.channelid = b.channelid) AND (a.granulityid = b.granulityid) AND (a.timevalue = b.timevalue))"
" -> Index Scan using uq_rtu on rtu a (cost=0.43..289906.29 rows=3101628 width=52) (actual time=0.059..2467.145 rows=3102319 loops=1)"
" -> Index Scan using uq_audit0 on audit0 b (cost=0.42..10305.46 rows=98020 width=52) (actual time=0.049..108.138 rows=98020 loops=1)"
" CTE s1"
" -> Unique (cost=4.37..4.38 rows=1 width=148) (actual time=4445.865..4509.839 rows=61692 loops=1)"
" -> Sort (cost=4.37..4.38 rows=1 width=148) (actual time=4445.863..4471.002 rows=63635 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: external merge Disk: 5624kB"
" -> Hash Right Join (cost=0.03..4.36 rows=1 width=148) (actual time=4102.842..4277.641 rows=63635 loops=1)"
" Hash Cond: (sf.channelid = s0.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.013..0.083 rows=168 loops=1)"
" -> Hash (cost=0.02..0.02 rows=1 width=132) (actual time=4102.002..4102.002 rows=63635 loops=1)"
" Buckets: 65536 (originally 1024) Batches: 2 (originally 1) Memory Usage: 3841kB"
" -> CTE Scan on s0 (cost=0.00..0.02 rows=1 width=132) (actual time=20.413..4038.078 rows=63635 loops=1)"
" CTE s2"
" -> CTE Scan on s1 (cost=0.00..0.07 rows=1 width=168) (actual time=4445.910..4972.832 rows=61692 loops=1)"
" -> CTE Scan on s2 (cost=0.00..0.05 rows=1 width=232) (actual time=4445.934..5312.884 rows=61692 loops=1)"
"Planning time: 1.782 ms"
"Execution time: 6201.148 ms"
而且我知道我必须有 67106 行。
在撰写本文时,我知道 S0
returns 正确的行数。所以问题一定出在进一步CTE
.
我觉得很奇怪的是:
EXPLAIN ANALYZE
WITH
S0 AS (
SELECT * FROM datastore.audit0
),
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
)
SELECT * FROM S1 WHERE Channelid=123 AND GranulityId=4 -- POST-FILTERING
returns 10 行:
"CTE Scan on s1 (cost=24554.34..24799.39 rows=1 width=196) (actual time=686.211..822.803 rows=10 loops=1)"
" Filter: ((channelid = 123) AND (granulityid = 4))"
" Rows Removed by Filter: 94890"
" CTE s0"
" -> Seq Scan on audit0 (cost=0.00..2603.20 rows=98020 width=160) (actual time=0.009..26.092 rows=98020 loops=1)"
" CTE s1"
" -> Unique (cost=21215.99..21951.14 rows=9802 width=176) (actual time=590.337..705.070 rows=94900 loops=1)"
" -> Sort (cost=21215.99..21461.04 rows=98020 width=176) (actual time=590.335..665.152 rows=99151 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: external merge Disk: 12376kB"
" -> Hash Left Join (cost=5.78..4710.74 rows=98020 width=176) (actual time=0.143..346.949 rows=99151 loops=1)"
" Hash Cond: (s0.channelid = sf.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> CTE Scan on s0 (cost=0.00..1960.40 rows=98020 width=160) (actual time=0.012..116.543 rows=98020 loops=1)"
" -> Hash (cost=3.68..3.68 rows=168 width=20) (actual time=0.096..0.096 rows=168 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 12kB"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.006..0.045 rows=168 loops=1)"
"Planning time: 0.385 ms"
"Execution time: 846.179 ms"
下一个 returns 正确的行数:
EXPLAIN ANALYZE
WITH
S0 AS (
SELECT * FROM datastore.audit0
WHERE Channelid=123 AND GranulityId=4 -- PRE FILTERING
),
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
)
SELECT * FROM S1
其中:
"CTE Scan on s1 (cost=133.62..133.86 rows=12 width=196) (actual time=0.580..0.598 rows=12 loops=1)"
" CTE s0"
" -> Bitmap Heap Scan on audit0 (cost=83.26..128.35 rows=12 width=160) (actual time=0.401..0.423 rows=12 loops=1)"
" Recheck Cond: ((channelid = 123) AND (granulityid = 4))"
" Heap Blocks: exact=12"
" -> BitmapAnd (cost=83.26..83.26 rows=12 width=0) (actual time=0.394..0.394 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_audit0_channel (cost=0.00..11.12 rows=377 width=0) (actual time=0.055..0.055 rows=377 loops=1)"
" Index Cond: (channelid = 123)"
" -> Bitmap Index Scan on idx_audit0_granulity (cost=0.00..71.89 rows=3146 width=0) (actual time=0.331..0.331 rows=3120 loops=1)"
" Index Cond: (granulityid = 4)"
" CTE s1"
" -> Unique (cost=5.19..5.28 rows=12 width=176) (actual time=0.576..0.581 rows=12 loops=1)"
" -> Sort (cost=5.19..5.22 rows=12 width=176) (actual time=0.576..0.576 rows=12 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: quicksort Memory: 20kB"
" -> Hash Right Join (cost=0.39..4.97 rows=12 width=176) (actual time=0.522..0.552 rows=12 loops=1)"
" Hash Cond: (sf.channelid = s0.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.006..0.022 rows=168 loops=1)"
" -> Hash (cost=0.24..0.24 rows=12 width=160) (actual time=0.446..0.446 rows=12 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 6kB"
" -> CTE Scan on s0 (cost=0.00..0.24 rows=12 width=160) (actual time=0.403..0.432 rows=12 loops=1)"
"Planning time: 0.448 ms"
"Execution time: 4.510 ms"
因此问题似乎出在S1
。没有为 channelid = 123
定义有效数字,因此,没有 LEFT JOIN
不应生成这些记录。但这并不能解释为什么会有一些缺失。
问题
- 我在这个查询中做错了什么?
我使用LEFT JOIN
是为了在获取有效数字时保持正确的基数,因此它不能删除记录,之后它只是算术。
- 预过滤如何返回比post-过滤更多的行?
这对我来说听起来有点问题。如果我不使用 WHERE
子句,则会生成所有记录(或组合)(我知道 JOIN
是一个 WHERE
子句)然后进行计算。当我不使用额外的 WHERE
(原始查询)时,我会错过行(如示例中所示)。当我添加 WHERE 子句进行过滤时,结果不同(这可能很好,如果 post-过滤返回的记录比预过滤返回的记录多)。
欢迎任何指出我的错误和对查询的误解的建设性答案。谢谢你。
它们实际上是两个逻辑上不同的查询,因为 DISTINCT ON(ChannelId, TimeValue) ... ORDER BY ChannelId, TimeValue, StartTimestamp
和 WHERE Channelid=123 AND GranulityId=4
的操作顺序不同。看看
create table sample(
distinctkey int,
orderkey int,
valkey int
);
insert into sample (distinctkey,orderkey,valkey)
select 1,10,150
union all
select 1,20,100;
还有两个与您相似的查询:
select distinctkey, orderkey, valkey
from (
select distinct on(distinctkey) distinctkey, orderkey, valkey
from sample
order by distinctkey, orderkey) t
where distinctkey = 1 and valkey = 100;
returns 没有行。虽然
select distinct on(distinctkey) distinctkey, orderkey, valkey
from (
select distinctkey, orderkey,valkey
from sample
where distinctkey = 1 and valkey = 100) t
order by distinctkey, orderkey;
returns 1 行。
同样,您的查询可能 return 不同的行数,具体取决于数据。您应该只选择一个与您面临的任务相关的逻辑。
发生了什么事
由于 S1
中的 DISTINCT ON
子句,您可能缺少行。看来您正在使用它来仅选择 SVPOLFactor
的最新适用行。然而,你写了
DISTINCT ON(ChannelId, TimeValue)
在查询 S0
中,唯一行也可能相差 GranulityId
and/or QualityCodeId
。因此,例如,如果您在 rtu
和 audit0
中都有包含以下列的行:
Id | ChannelId | GranulityId | TimeValue | QualityCodeid
----|-----------+-------------+---------------------+---------------
1 | 123 | 4 | 2015-01-01 00:00:00 | 2
2 | 123 | 5 | 2015-01-01 00:00:00 | 2
然后 S0
如果没有 WHERE
过滤,这两个都会有 return 行,因为它们在 GranulityId
上不同。但是其中之一会被 S1
中的 DISTINCT ON
子句删除,因为它们对 ChannelId
和 TimeValue
具有相同的值。更糟糕的是,因为您只按 ChannelId
和 TimeValue
排序,选择哪一行和删除哪一行并不由您的查询中的任何内容决定——这完全是偶然的!
在您的“post-过滤”示例中 WHERE ChannelId = 123 AND GranulityId = 4
,这两行都在 S0
中。然后,根据您无法真正控制的顺序,S1
中的 DISTINCT ON
可以过滤掉第 1 行而不是第 2 行。然后,第 2 行在结束,留下 neither 行。 DISTINCT ON
子句中的错误导致您甚至不想看到的第 2 行在中间查询中消除了第 1 行。
在 S0
中的“预过滤”示例中,您在第 2 行干扰第 1 行之前过滤掉它,因此第 1 行进入最终查询。
修复
阻止这些行被排除的一种方法是扩展 DISTINCT ON
和 ORDER BY
子句以包括 GranulityId
和 QualityCodeId
:
DISTINCT ON(ChannelId, TimeValue, GranulityId, QualityCodeId)
-- ...
ORDER BY ChannelId, TimeValue, GranulityId, QualityCodeId, StartTimestamp DESC
当然,如果您过滤 S0
的结果,使其中某些列的值都相同,则可以省略 DISTINCT ON
中的那些。在您使用 ChannelId
和 GranulityId
预过滤 S0
的示例中,这可能是:
DISTINCT ON(TimeValue, QualityCodeId)
-- ...
ORDER BY TimeValue, QualityCodeId, StartTimestamp DESC
但我怀疑你这样做会节省很多时间,所以保留所有这些列可能是最安全的,以防有一天你再次更改查询而忘记更改 DISTINCT ON
.
我想提一下 the PostgreSQL docs 警告 DISTINCT ON
(强调我的问题):
A set of rows for which all the [DISTINCT ON
] expressions are equal are considered duplicates, and only the first row of the set is kept in the output. Note that the "first row" of a set is unpredictable unless the query is sorted on enough columns to guarantee a unique ordering of the rows arriving at the DISTINCT
filter. (DISTINCT ON
processing occurs after ORDER BY
sorting.)
The DISTINCT ON
clause is not part of the SQL standard and is sometimes considered bad style because of the potentially indeterminate nature of its results. With judicious use of GROUP BY
and subqueries in FROM
, this construct can be avoided, but it is often the most convenient alternative.
您已经答对了,这只是补充。当您在 Derived Table 中计算 start/end 时,连接 returns 一行并且您不需要 DISTINCT ON
(这也可能更有效):
...
FROM S0 LEFT JOIN
(
SELECT *,
-- find the next StartTimestamp = End of the current period
COALESCE(LEAD(StartTimestamp)
OVER (PARTITION BY ChannelId
ORDER BY StartTimestamp, '+infinity') AS EndTimestamp
FROM SVPOLFactor AS t
) AS SF
ON (S0.ChannelId = SF.ChannelId)
AND (S0.TimeValue >= SF.StartTimestamp)
AND (S0.TimeValue < SF.EndTimestamp)
这个 post 已经完全改写以使问题更容易理解。
设置
PostgreSQL 9.5
运行 宁 Ubuntu Server 14.04 LTS
。
数据模型
我有数据集 tables,我在其中单独存储数据(时间序列),所有这些 tables 必须共享相同的结构:
CREATE TABLE IF NOT EXISTS %s(
Id SERIAL NOT NULL,
ChannelId INTEGER NOT NULL,
GranulityIdIn INTEGER,
GranulityId INTEGER NOT NULL,
TimeValue TIMESTAMP NOT NULL,
FloatValue FLOAT DEFAULT(NULL),
Status BIGINT DEFAULT(NULL),
QualityCodeId INTEGER NOT NULL,
DataArray FLOAT[] DEFAULT(NULL),
DataCount BIGINT DEFAULT(NULL),
Performance FLOAT DEFAULT(NULL),
StepCount INTEGER NOT NULL DEFAULT(0),
TableRegClass regclass NOT NULL,
Updated TIMESTAMP NOT NULL,
Tags TEXT[] DEFAULT(NULL),
--
CONSTRAINT PK_%s PRIMARY KEY(Id),
CONSTRAINT FK_%s_Channel FOREIGN KEY(ChannelId) REFERENCES scientific.Channel(Id),
CONSTRAINT FK_%s_GranulityIn FOREIGN KEY(GranulityIdIn) REFERENCES quality.Granulity(Id),
CONSTRAINT FK_%s_Granulity FOREIGN KEY(GranulityId) REFERENCES quality.Granulity(Id),
CONSTRAINT FK_%s_QualityCode FOREIGN KEY(QualityCodeId) REFERENCES quality.QualityCode(Id),
CONSTRAINT UQ_%s UNIQUE(QualityCodeId, ChannelId, GranulityId, TimeValue)
);
CREATE INDEX IDX_%s_Channel ON %s USING btree(ChannelId);
CREATE INDEX IDX_%s_Quality ON %s USING btree(QualityCodeId);
CREATE INDEX IDX_%s_Granulity ON %s USING btree(GranulityId) WHERE GranulityId > 2;
CREATE INDEX IDX_%s_TimeValue ON %s USING btree(TimeValue);
这个定义来自一个FUNCTION
,因此%s
代表数据集名称。
UNIQUE
约束确保给定数据集中不能有重复记录。此数据集中的记录是给定通道 (channelid
) 的值 (floatvalue
),在给定时间间隔 (granulityid
) 的给定时间 (timevalue
) 采样,具有给定的质量 (qualitycodeid
)。无论值是什么,都不能有 (channelid, timevalue, granulityid, qualitycodeid)
.
数据集中的记录如下:
1;25;;1;"2015-01-01 00:00:00";0.54;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
2;25;;1;"2015-01-01 00:30:00";0.49;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
3;25;;1;"2015-01-01 01:00:00";0.47;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
我还有另一颗卫星 table,我在其中存储频道的有效数字,此参数会随时间变化。我按以下方式存储它:
CREATE TABLE SVPOLFactor (
Id SERIAL NOT NULL,
ChannelId INTEGER NOT NULL,
StartTimestamp TIMESTAMP NOT NULL,
Factor FLOAT NOT NULL,
UnitsId VARCHAR(8) NOT NULL,
--
CONSTRAINT PK_SVPOLFactor PRIMARY KEY(Id),
CONSTRAINT FK_SVPOLFactor_Units FOREIGN KEY(UnitsId) REFERENCES Units(Id),
CONSTRAINT UQ_SVPOLFactor UNIQUE(ChannelId, StartTimestamp)
);
当为通道定义了有效数字时,将向此添加一行 table。那么该因素自该日期起适用。第一条记录始终具有标记值 '-infinity'::TIMESTAMP
,这意味着:该因素从一开始就适用。下一行必须有一个真正定义的值。如果给定通道没有行,则表示有效数字是单一的。
table 中的记录如下:
123;277;"-infinity";0.1;"_C"
124;1001;"-infinity";0.01;"-"
125;1001;"2014-03-01 00:00:00";0.1;"-"
126;1001;"2014-06-01 00:00:00";1;"-"
127;1001;"2014-09-01 00:00:00";10;"-"
5001;5181;"-infinity";0.1;"ug/m3"
目标
我的目标是对由不同进程填充的两个数据集执行比较审计。要实现它,我必须:
- 比较数据集之间的记录并评估它们的差异;
- 检查相似记录之间的差异是否包含在有效数字内。
为此,我编写了以下查询,其行为方式我不理解:
WITH
-- Join records before records (regard to uniqueness constraint) from datastore templated tables in order to make audit comparison:
S0 AS (
SELECT
A.ChannelId
,A.GranulityIdIn AS gidInRef
,B.GranulityIdIn AS gidInAudit
,A.GranulityId AS GranulityId
,A.QualityCodeId
,A.TimeValue
,A.FloatValue AS xRef
,B.FloatValue AS xAudit
,A.StepCount AS scRef
,B.StepCount AS scAudit
,A.DataCount AS dcRef
,B.DataCount AS dcAudit
,round(A.Performance::NUMERIC, 4) AS pRef
,round(B.Performance::NUMERIC, 4) AS pAudit
FROM
datastore.rtu AS A JOIN datastore.audit0 AS B USING(ChannelId, GranulityId, QualityCodeId, TimeValue)
),
-- Join before SVPOL factors in order to determine decimal factor applied to records:
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
),
-- Audit computation:
S2 AS (
SELECT
S1.*
,xaudit - xref AS dx
,(xaudit - xref)/NULLIF(xref, 0) AS rdx
,round(xaudit*pow(10, k))*pow(10, -k) AS xroundfloat
,round(xaudit::NUMERIC, k) AS xroundnum
,0.5*pow(10, -k) AS epsilon
FROM S1
)
SELECT
*
,ABS(dx) AS absdx
,ABS(rdx) AS absrdx
,(xroundfloat - xref) AS dxroundfloat
,(xroundnum - xref) AS dxroundnum
,(ABS(dx) - epsilon) AS dxeps
,(ABS(dx) - epsilon)/epsilon AS rdxeps
,(xroundfloat - xroundnum) AS dfround
FROM
S2
ORDER BY
k DESC
,ABS(rdx) DESC
,ChannelId;
该查询可能有些难以理解,我大致希望它:
- 使用唯一性约束连接来自两个数据集的数据以比较相似记录并计算差异 (
S0
); - 对于每个差异,找到适用于当前时间戳 (
S1
) 的有效数字 (LEFT JOIN
); - 执行一些其他有用的统计数据(
S2
和最终的SELECT
)。
问题
当我运行上面的查询时,我缺少行。例如:channelid=123
和 granulityid=4
在两个 table 中都有 12 条记录(datastore.rtu
和 datastore.audit0
)。当我执行整个查询并将其存储在 MATERIALIZED VIEW
中时,只有不到 12 行。然后我开始调查以了解为什么我缺少记录并且我遇到了 WHERE
子句的奇怪行为。如果我执行此查询的 EXPLAIN ANALIZE
,我得到:
"Sort (cost=332212.76..332212.77 rows=1 width=232) (actual time=6042.736..6157.235 rows=61692 loops=1)"
" Sort Key: s2.k DESC, (abs(s2.rdx)) DESC, s2.channelid"
" Sort Method: external merge Disk: 10688kB"
" CTE s0"
" -> Merge Join (cost=0.85..332208.25 rows=1 width=84) (actual time=20.408..3894.071 rows=63635 loops=1)"
" Merge Cond: ((a.qualitycodeid = b.qualitycodeid) AND (a.channelid = b.channelid) AND (a.granulityid = b.granulityid) AND (a.timevalue = b.timevalue))"
" -> Index Scan using uq_rtu on rtu a (cost=0.43..289906.29 rows=3101628 width=52) (actual time=0.059..2467.145 rows=3102319 loops=1)"
" -> Index Scan using uq_audit0 on audit0 b (cost=0.42..10305.46 rows=98020 width=52) (actual time=0.049..108.138 rows=98020 loops=1)"
" CTE s1"
" -> Unique (cost=4.37..4.38 rows=1 width=148) (actual time=4445.865..4509.839 rows=61692 loops=1)"
" -> Sort (cost=4.37..4.38 rows=1 width=148) (actual time=4445.863..4471.002 rows=63635 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: external merge Disk: 5624kB"
" -> Hash Right Join (cost=0.03..4.36 rows=1 width=148) (actual time=4102.842..4277.641 rows=63635 loops=1)"
" Hash Cond: (sf.channelid = s0.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.013..0.083 rows=168 loops=1)"
" -> Hash (cost=0.02..0.02 rows=1 width=132) (actual time=4102.002..4102.002 rows=63635 loops=1)"
" Buckets: 65536 (originally 1024) Batches: 2 (originally 1) Memory Usage: 3841kB"
" -> CTE Scan on s0 (cost=0.00..0.02 rows=1 width=132) (actual time=20.413..4038.078 rows=63635 loops=1)"
" CTE s2"
" -> CTE Scan on s1 (cost=0.00..0.07 rows=1 width=168) (actual time=4445.910..4972.832 rows=61692 loops=1)"
" -> CTE Scan on s2 (cost=0.00..0.05 rows=1 width=232) (actual time=4445.934..5312.884 rows=61692 loops=1)"
"Planning time: 1.782 ms"
"Execution time: 6201.148 ms"
而且我知道我必须有 67106 行。
在撰写本文时,我知道 S0
returns 正确的行数。所以问题一定出在进一步CTE
.
我觉得很奇怪的是:
EXPLAIN ANALYZE
WITH
S0 AS (
SELECT * FROM datastore.audit0
),
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
)
SELECT * FROM S1 WHERE Channelid=123 AND GranulityId=4 -- POST-FILTERING
returns 10 行:
"CTE Scan on s1 (cost=24554.34..24799.39 rows=1 width=196) (actual time=686.211..822.803 rows=10 loops=1)"
" Filter: ((channelid = 123) AND (granulityid = 4))"
" Rows Removed by Filter: 94890"
" CTE s0"
" -> Seq Scan on audit0 (cost=0.00..2603.20 rows=98020 width=160) (actual time=0.009..26.092 rows=98020 loops=1)"
" CTE s1"
" -> Unique (cost=21215.99..21951.14 rows=9802 width=176) (actual time=590.337..705.070 rows=94900 loops=1)"
" -> Sort (cost=21215.99..21461.04 rows=98020 width=176) (actual time=590.335..665.152 rows=99151 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: external merge Disk: 12376kB"
" -> Hash Left Join (cost=5.78..4710.74 rows=98020 width=176) (actual time=0.143..346.949 rows=99151 loops=1)"
" Hash Cond: (s0.channelid = sf.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> CTE Scan on s0 (cost=0.00..1960.40 rows=98020 width=160) (actual time=0.012..116.543 rows=98020 loops=1)"
" -> Hash (cost=3.68..3.68 rows=168 width=20) (actual time=0.096..0.096 rows=168 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 12kB"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.006..0.045 rows=168 loops=1)"
"Planning time: 0.385 ms"
"Execution time: 846.179 ms"
下一个 returns 正确的行数:
EXPLAIN ANALYZE
WITH
S0 AS (
SELECT * FROM datastore.audit0
WHERE Channelid=123 AND GranulityId=4 -- PRE FILTERING
),
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
)
SELECT * FROM S1
其中:
"CTE Scan on s1 (cost=133.62..133.86 rows=12 width=196) (actual time=0.580..0.598 rows=12 loops=1)"
" CTE s0"
" -> Bitmap Heap Scan on audit0 (cost=83.26..128.35 rows=12 width=160) (actual time=0.401..0.423 rows=12 loops=1)"
" Recheck Cond: ((channelid = 123) AND (granulityid = 4))"
" Heap Blocks: exact=12"
" -> BitmapAnd (cost=83.26..83.26 rows=12 width=0) (actual time=0.394..0.394 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_audit0_channel (cost=0.00..11.12 rows=377 width=0) (actual time=0.055..0.055 rows=377 loops=1)"
" Index Cond: (channelid = 123)"
" -> Bitmap Index Scan on idx_audit0_granulity (cost=0.00..71.89 rows=3146 width=0) (actual time=0.331..0.331 rows=3120 loops=1)"
" Index Cond: (granulityid = 4)"
" CTE s1"
" -> Unique (cost=5.19..5.28 rows=12 width=176) (actual time=0.576..0.581 rows=12 loops=1)"
" -> Sort (cost=5.19..5.22 rows=12 width=176) (actual time=0.576..0.576 rows=12 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: quicksort Memory: 20kB"
" -> Hash Right Join (cost=0.39..4.97 rows=12 width=176) (actual time=0.522..0.552 rows=12 loops=1)"
" Hash Cond: (sf.channelid = s0.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.006..0.022 rows=168 loops=1)"
" -> Hash (cost=0.24..0.24 rows=12 width=160) (actual time=0.446..0.446 rows=12 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 6kB"
" -> CTE Scan on s0 (cost=0.00..0.24 rows=12 width=160) (actual time=0.403..0.432 rows=12 loops=1)"
"Planning time: 0.448 ms"
"Execution time: 4.510 ms"
因此问题似乎出在S1
。没有为 channelid = 123
定义有效数字,因此,没有 LEFT JOIN
不应生成这些记录。但这并不能解释为什么会有一些缺失。
问题
- 我在这个查询中做错了什么?
我使用LEFT JOIN
是为了在获取有效数字时保持正确的基数,因此它不能删除记录,之后它只是算术。
- 预过滤如何返回比post-过滤更多的行?
这对我来说听起来有点问题。如果我不使用 WHERE
子句,则会生成所有记录(或组合)(我知道 JOIN
是一个 WHERE
子句)然后进行计算。当我不使用额外的 WHERE
(原始查询)时,我会错过行(如示例中所示)。当我添加 WHERE 子句进行过滤时,结果不同(这可能很好,如果 post-过滤返回的记录比预过滤返回的记录多)。
欢迎任何指出我的错误和对查询的误解的建设性答案。谢谢你。
它们实际上是两个逻辑上不同的查询,因为 DISTINCT ON(ChannelId, TimeValue) ... ORDER BY ChannelId, TimeValue, StartTimestamp
和 WHERE Channelid=123 AND GranulityId=4
的操作顺序不同。看看
create table sample(
distinctkey int,
orderkey int,
valkey int
);
insert into sample (distinctkey,orderkey,valkey)
select 1,10,150
union all
select 1,20,100;
还有两个与您相似的查询:
select distinctkey, orderkey, valkey
from (
select distinct on(distinctkey) distinctkey, orderkey, valkey
from sample
order by distinctkey, orderkey) t
where distinctkey = 1 and valkey = 100;
returns 没有行。虽然
select distinct on(distinctkey) distinctkey, orderkey, valkey
from (
select distinctkey, orderkey,valkey
from sample
where distinctkey = 1 and valkey = 100) t
order by distinctkey, orderkey;
returns 1 行。
同样,您的查询可能 return 不同的行数,具体取决于数据。您应该只选择一个与您面临的任务相关的逻辑。
发生了什么事
由于 S1
中的 DISTINCT ON
子句,您可能缺少行。看来您正在使用它来仅选择 SVPOLFactor
的最新适用行。然而,你写了
DISTINCT ON(ChannelId, TimeValue)
在查询 S0
中,唯一行也可能相差 GranulityId
and/or QualityCodeId
。因此,例如,如果您在 rtu
和 audit0
中都有包含以下列的行:
Id | ChannelId | GranulityId | TimeValue | QualityCodeid
----|-----------+-------------+---------------------+---------------
1 | 123 | 4 | 2015-01-01 00:00:00 | 2
2 | 123 | 5 | 2015-01-01 00:00:00 | 2
然后 S0
如果没有 WHERE
过滤,这两个都会有 return 行,因为它们在 GranulityId
上不同。但是其中之一会被 S1
中的 DISTINCT ON
子句删除,因为它们对 ChannelId
和 TimeValue
具有相同的值。更糟糕的是,因为您只按 ChannelId
和 TimeValue
排序,选择哪一行和删除哪一行并不由您的查询中的任何内容决定——这完全是偶然的!
在您的“post-过滤”示例中 WHERE ChannelId = 123 AND GranulityId = 4
,这两行都在 S0
中。然后,根据您无法真正控制的顺序,S1
中的 DISTINCT ON
可以过滤掉第 1 行而不是第 2 行。然后,第 2 行在结束,留下 neither 行。 DISTINCT ON
子句中的错误导致您甚至不想看到的第 2 行在中间查询中消除了第 1 行。
在 S0
中的“预过滤”示例中,您在第 2 行干扰第 1 行之前过滤掉它,因此第 1 行进入最终查询。
修复
阻止这些行被排除的一种方法是扩展 DISTINCT ON
和 ORDER BY
子句以包括 GranulityId
和 QualityCodeId
:
DISTINCT ON(ChannelId, TimeValue, GranulityId, QualityCodeId)
-- ...
ORDER BY ChannelId, TimeValue, GranulityId, QualityCodeId, StartTimestamp DESC
当然,如果您过滤 S0
的结果,使其中某些列的值都相同,则可以省略 DISTINCT ON
中的那些。在您使用 ChannelId
和 GranulityId
预过滤 S0
的示例中,这可能是:
DISTINCT ON(TimeValue, QualityCodeId)
-- ...
ORDER BY TimeValue, QualityCodeId, StartTimestamp DESC
但我怀疑你这样做会节省很多时间,所以保留所有这些列可能是最安全的,以防有一天你再次更改查询而忘记更改 DISTINCT ON
.
我想提一下 the PostgreSQL docs 警告 DISTINCT ON
(强调我的问题):
A set of rows for which all the [
DISTINCT ON
] expressions are equal are considered duplicates, and only the first row of the set is kept in the output. Note that the "first row" of a set is unpredictable unless the query is sorted on enough columns to guarantee a unique ordering of the rows arriving at theDISTINCT
filter. (DISTINCT ON
processing occurs afterORDER BY
sorting.)The
DISTINCT ON
clause is not part of the SQL standard and is sometimes considered bad style because of the potentially indeterminate nature of its results. With judicious use ofGROUP BY
and subqueries inFROM
, this construct can be avoided, but it is often the most convenient alternative.
您已经答对了,这只是补充。当您在 Derived Table 中计算 start/end 时,连接 returns 一行并且您不需要 DISTINCT ON
(这也可能更有效):
...
FROM S0 LEFT JOIN
(
SELECT *,
-- find the next StartTimestamp = End of the current period
COALESCE(LEAD(StartTimestamp)
OVER (PARTITION BY ChannelId
ORDER BY StartTimestamp, '+infinity') AS EndTimestamp
FROM SVPOLFactor AS t
) AS SF
ON (S0.ChannelId = SF.ChannelId)
AND (S0.TimeValue >= SF.StartTimestamp)
AND (S0.TimeValue < SF.EndTimestamp)