从仅在值更改时记录的 table 推断每日历史值 (Postgresql 9.3)
Extrapolate daily historical values from a table that only records when a value changes (Postgresql 9.3)
我有一个 table 每次位置分数发生变化时都会记录一行。
score_history:
- id int PK(uuid 自动递增 int)
- happened_at 时间戳(分数改变时)
- location_id int FK(值所在的位置)
- score float(新分数)
这样做是为了提高效率,并且能够简单地检索给定位置的更改列表并很好地满足该目的。
我正在尝试以非常冗余的格式输出数据,以帮助将其加载到严格的外部系统中。外部系统期望每个位置 * 每个日期都有一行。目标是表示每个日期每个位置的最后得分值。因此,如果分数在给定日期内更改了 3 次,则只有最接近午夜的分数将被视为当天收盘分数。我想这类似于创建关闭业务库存水平事实的挑战 table。
我有一个方便的星型模式日期维度 table,其中每个日期都有一行,完全涵盖这个样本期和未来。
那个table长得像
dw_dim_date:
- 日期日期PK
- 一堆其他列,例如周数,is_us_holiday 等
所以,如果我在 score_history table 中只有 3 条记录...
1, 2019-01-01:10:13:01, 100, 5.0
2, 2019-01-05:20:00:01, 100, 5.8
3, 2019-01-05:23:01:22, 100, 6.2
所需的输出将是:
2019-01-01, 100, 5.0
2019-01-02, 100, 5.0
2019-01-03, 100, 5.0
2019-01-04, 100, 5.0
2019-01-05, 100, 6.2
3 要求:
- 即使没有分数记录,每个位置每天一行
那天。
- 如果最后一天有分数记录
午夜之前的一个应该是该行的分数值。如果出现平局,两者中较大者应 "win".
- 如果当天的分数记录为零,则分数应为最近的先前分数。
我一直在跟踪子查询和 window 函数。
因为我犹豫 post 没有我尝试过的东西我会分享这个 trainwreck 产生输出但没有意义...
SELECT dw_dim_date.date,
(SELECT score
FROM score_history
WHERE score_history.happened_at::DATE < dw_dim_date.date
OR score_history.happened_at::DATE = dw_dim_date.date
ORDER BY score_history.id desc limit 1) as last_score
FROM dw_dim_date
WHERE dw_dim_date.date > '2019-06-01'
感谢您提供指导或指向其他问题以供阅读。
我想你可以尝试这样的事情。我更改的主要内容是将内容包装在 DATE() 中并为日期查找器使用另一个 SO 答案:
SELECT
dw_dim_date.date,
(
SELECT
score
FROM
score_history
WHERE
DATE(score_history.happened_at) <= dw_dim_date.date
ORDER BY
score_history.happened_at DESC
LIMIT
1
) as last_score
FROM
dw_dim_date
WHERE
dw_dim_date.date >= DATE('2019-01-01')
这使用此处的 SQL 方法查找与所请求数据最近的过去数据:PostgreSQL return exact or closest date to queried date
WITH
max_per_day_location AS (
SELECT
SH.happened_at::DATE as day,
SH.location_id,
max(SH.happened_at) as happened_at
FROM
score_history SH
GROUP BY
SH.happened_at::DATE,
SH.location_id
),
date_location AS (
SELECT DISTINCT
DD."date",
SH.location_id
FROM
dw_dim_date DD,
max_per_day_location SH
),
value_partition AS (
SELECT
DD."date",
DD.location_id,
SH.score,
SH.happened_at,
MPD.happened_at as hap2,
sum(case when score is null then 0 else 1 end) OVER
(PARTITION BY DD.location_id ORDER BY "date", SH.happened_at desc) AS value_partition
FROM
date_location DD
LEFT JOIN score_history SH
ON DD."date" = SH.happened_at::DATE
AND DD.location_id = SH.location_id
LEFT join max_per_day_location MPD
ON SH.happened_at = MPD.happened_at
WHERE NOT (MPD.happened_at IS NULL
AND
SH.happened_at IS NOT NULL)
ORDER BY
DD."date"
),
final AS (
SELECT
"date",
location_id,
first_value(score) over w
FROM
value_partition
WINDOW w AS (PARTITION BY location_id, value_partition
ORDER BY happened_at rows between unbounded preceding and unbounded following)
order by "date"
)
SELECT DISTINCT * FROM final ORDER BY location_id, date
;
我相信有更简单的方法可以做到这一点。
我这里有一个带有一些测试数据的 SQLFiddle:
http://sqlfiddle.com/#!17/9d122/1
使这项工作起作用的主要事情是使 "value partition" 访问先前的非空值。更多信息:
How do I efficiently select the previous non-null value?
date_location
子查询每天每 location_id 生成一行,因为这是输出中所需的基础 "row level"。
max_per_day_location
子查询用于过滤掉具有多个分数的 location/day 组合的较早条目,并且只保留当天的最后一个。
您可以使用相关子查询和 LATERAL
:
来实现它
SELECT sub.date, sub.location_id, score
FROM (SELECT * FROM dw_dim_date
CROSS JOIN (SELECT DISTINCT location_id FROM score_history) s
WHERE date >= '2019-01-01'::date) sub
,LATERAL(SELECT score FROM score_history sc
WHERE sc.happened_at::date <= sub.date
AND sc.location_id = sub.location_id
ORDER BY happened_at DESC LIMIT 1) l
,LATERAL(SELECT MIN(happened_at::date) m1, MAX(happened_at::date) m2
FROM score_history sc
WHERE sc.location_id = sub.location_id) lm
WHERE sub.date BETWEEN lm.m1 AND lm.m2
ORDER BY location_id, date;
工作原理:
1) s
(它是每个 location_id 的所有日期的交叉连接)
2) l
(选择每个位置的分数)
3) lm
(每个位置选择 min/max 日期进行过滤)
4) WHERE
在可用范围内过滤日期,如果需要可以放宽
最简单的解决方案可能是:
select dw_dim_date.date, location_id, score
from dw_dim_date, score_history S1
where happened_at::date <= dw_dim_date.date and
not exists (select *
from score_history S2
where S2.happened_at::date <= dw_dim_date.date and
S1.happened_at< S2.happened_at and
S1.location_id = S2.location_id)
这会计算日期和分数历史记录之间的笛卡尔积,然后为每个日期和位置获取不存在较晚分数(在日期期间内)的分数。我建议从这个开始,因为它可能是最容易维护的,并且只有在效率不够高(使用适当的索引)时才使用更复杂的解决方案。
因为 SQL Fiddle 在 https://dbfiddle.uk/?rdbms=postgres_9.4&fiddle=3c2e4ae49cbc43f7840b942d223be119
我有一个 table 每次位置分数发生变化时都会记录一行。
score_history:
- id int PK(uuid 自动递增 int)
- happened_at 时间戳(分数改变时)
- location_id int FK(值所在的位置)
- score float(新分数)
这样做是为了提高效率,并且能够简单地检索给定位置的更改列表并很好地满足该目的。
我正在尝试以非常冗余的格式输出数据,以帮助将其加载到严格的外部系统中。外部系统期望每个位置 * 每个日期都有一行。目标是表示每个日期每个位置的最后得分值。因此,如果分数在给定日期内更改了 3 次,则只有最接近午夜的分数将被视为当天收盘分数。我想这类似于创建关闭业务库存水平事实的挑战 table。
我有一个方便的星型模式日期维度 table,其中每个日期都有一行,完全涵盖这个样本期和未来。
那个table长得像
dw_dim_date:
- 日期日期PK
- 一堆其他列,例如周数,is_us_holiday 等
所以,如果我在 score_history table 中只有 3 条记录...
1, 2019-01-01:10:13:01, 100, 5.0
2, 2019-01-05:20:00:01, 100, 5.8
3, 2019-01-05:23:01:22, 100, 6.2
所需的输出将是:
2019-01-01, 100, 5.0
2019-01-02, 100, 5.0
2019-01-03, 100, 5.0
2019-01-04, 100, 5.0
2019-01-05, 100, 6.2
3 要求:
- 即使没有分数记录,每个位置每天一行 那天。
- 如果最后一天有分数记录 午夜之前的一个应该是该行的分数值。如果出现平局,两者中较大者应 "win".
- 如果当天的分数记录为零,则分数应为最近的先前分数。
我一直在跟踪子查询和 window 函数。
因为我犹豫 post 没有我尝试过的东西我会分享这个 trainwreck 产生输出但没有意义...
SELECT dw_dim_date.date,
(SELECT score
FROM score_history
WHERE score_history.happened_at::DATE < dw_dim_date.date
OR score_history.happened_at::DATE = dw_dim_date.date
ORDER BY score_history.id desc limit 1) as last_score
FROM dw_dim_date
WHERE dw_dim_date.date > '2019-06-01'
感谢您提供指导或指向其他问题以供阅读。
我想你可以尝试这样的事情。我更改的主要内容是将内容包装在 DATE() 中并为日期查找器使用另一个 SO 答案:
SELECT
dw_dim_date.date,
(
SELECT
score
FROM
score_history
WHERE
DATE(score_history.happened_at) <= dw_dim_date.date
ORDER BY
score_history.happened_at DESC
LIMIT
1
) as last_score
FROM
dw_dim_date
WHERE
dw_dim_date.date >= DATE('2019-01-01')
这使用此处的 SQL 方法查找与所请求数据最近的过去数据:PostgreSQL return exact or closest date to queried date
WITH
max_per_day_location AS (
SELECT
SH.happened_at::DATE as day,
SH.location_id,
max(SH.happened_at) as happened_at
FROM
score_history SH
GROUP BY
SH.happened_at::DATE,
SH.location_id
),
date_location AS (
SELECT DISTINCT
DD."date",
SH.location_id
FROM
dw_dim_date DD,
max_per_day_location SH
),
value_partition AS (
SELECT
DD."date",
DD.location_id,
SH.score,
SH.happened_at,
MPD.happened_at as hap2,
sum(case when score is null then 0 else 1 end) OVER
(PARTITION BY DD.location_id ORDER BY "date", SH.happened_at desc) AS value_partition
FROM
date_location DD
LEFT JOIN score_history SH
ON DD."date" = SH.happened_at::DATE
AND DD.location_id = SH.location_id
LEFT join max_per_day_location MPD
ON SH.happened_at = MPD.happened_at
WHERE NOT (MPD.happened_at IS NULL
AND
SH.happened_at IS NOT NULL)
ORDER BY
DD."date"
),
final AS (
SELECT
"date",
location_id,
first_value(score) over w
FROM
value_partition
WINDOW w AS (PARTITION BY location_id, value_partition
ORDER BY happened_at rows between unbounded preceding and unbounded following)
order by "date"
)
SELECT DISTINCT * FROM final ORDER BY location_id, date
;
我相信有更简单的方法可以做到这一点。
我这里有一个带有一些测试数据的 SQLFiddle: http://sqlfiddle.com/#!17/9d122/1
使这项工作起作用的主要事情是使 "value partition" 访问先前的非空值。更多信息:
How do I efficiently select the previous non-null value?
date_location
子查询每天每 location_id 生成一行,因为这是输出中所需的基础 "row level"。
max_per_day_location
子查询用于过滤掉具有多个分数的 location/day 组合的较早条目,并且只保留当天的最后一个。
您可以使用相关子查询和 LATERAL
:
SELECT sub.date, sub.location_id, score
FROM (SELECT * FROM dw_dim_date
CROSS JOIN (SELECT DISTINCT location_id FROM score_history) s
WHERE date >= '2019-01-01'::date) sub
,LATERAL(SELECT score FROM score_history sc
WHERE sc.happened_at::date <= sub.date
AND sc.location_id = sub.location_id
ORDER BY happened_at DESC LIMIT 1) l
,LATERAL(SELECT MIN(happened_at::date) m1, MAX(happened_at::date) m2
FROM score_history sc
WHERE sc.location_id = sub.location_id) lm
WHERE sub.date BETWEEN lm.m1 AND lm.m2
ORDER BY location_id, date;
工作原理:
1) s
(它是每个 location_id 的所有日期的交叉连接)
2) l
(选择每个位置的分数)
3) lm
(每个位置选择 min/max 日期进行过滤)
4) WHERE
在可用范围内过滤日期,如果需要可以放宽
最简单的解决方案可能是:
select dw_dim_date.date, location_id, score
from dw_dim_date, score_history S1
where happened_at::date <= dw_dim_date.date and
not exists (select *
from score_history S2
where S2.happened_at::date <= dw_dim_date.date and
S1.happened_at< S2.happened_at and
S1.location_id = S2.location_id)
这会计算日期和分数历史记录之间的笛卡尔积,然后为每个日期和位置获取不存在较晚分数(在日期期间内)的分数。我建议从这个开始,因为它可能是最容易维护的,并且只有在效率不够高(使用适当的索引)时才使用更复杂的解决方案。
因为 SQL Fiddle 在 https://dbfiddle.uk/?rdbms=postgres_9.4&fiddle=3c2e4ae49cbc43f7840b942d223be119