从仅在值更改时记录的 table 推断每日历史值 (Postgresql 9.3)

Question

我有一个 table 每次位置分数发生变化时都会记录一行。

score_history:

id int PK（uuid 自动递增 int）
happened_at 时间戳（分数改变时）
location_id int FK（值所在的位置）
score float（新分数）

这样做是为了提高效率，并且能够简单地检索给定位置的更改列表并很好地满足该目的。

我正在尝试以非常冗余的格式输出数据，以帮助将其加载到严格的外部系统中。外部系统期望每个位置 * 每个日期都有一行。目标是表示每个日期每个位置的最后得分值。因此，如果分数在给定日期内更改了 3 次，则只有最接近午夜的分数将被视为当天收盘分数。我想这类似于创建关闭业务库存水平事实的挑战 table。

我有一个方便的星型模式日期维度 table，其中每个日期都有一行，完全涵盖这个样本期和未来。

那个table长得像

dw_dim_date:

日期日期PK
一堆其他列，例如周数，is_us_holiday 等

所以，如果我在 score_history table 中只有 3 条记录...

1, 2019-01-01:10:13:01, 100, 5.0
2, 2019-01-05:20:00:01, 100, 5.8
3, 2019-01-05:23:01:22, 100, 6.2

所需的输出将是：

2019-01-01, 100, 5.0 
2019-01-02, 100, 5.0 
2019-01-03, 100, 5.0
2019-01-04, 100, 5.0 
2019-01-05, 100, 6.2

3 要求：

即使没有分数记录，每个位置每天一行那天。
如果最后一天有分数记录午夜之前的一个应该是该行的分数值。如果出现平局，两者中较大者应 "win".
如果当天的分数记录为零，则分数应为最近的先前分数。

我一直在跟踪子查询和 window 函数。

因为我犹豫 post 没有我尝试过的东西我会分享这个 trainwreck 产生输出但没有意义...

SELECT dw_dim_date.date,
       (SELECT score 
        FROM score_history 
        WHERE score_history.happened_at::DATE < dw_dim_date.date 
           OR score_history.happened_at::DATE = dw_dim_date.date 
        ORDER BY score_history.id desc limit 1) as last_score
FROM dw_dim_date
WHERE dw_dim_date.date > '2019-06-01'

感谢您提供指导或指向其他问题以供阅读。

Answer 1

我想你可以尝试这样的事情。我更改的主要内容是将内容包装在 DATE() 中并为日期查找器使用另一个 SO 答案：

SELECT
  dw_dim_date.date,
  (
    SELECT
      score
    FROM
      score_history
    WHERE
      DATE(score_history.happened_at) <= dw_dim_date.date
    ORDER BY
      score_history.happened_at DESC
    LIMIT
      1
  ) as last_score
FROM
  dw_dim_date
WHERE
  dw_dim_date.date >= DATE('2019-01-01')

这使用此处的 SQL 方法查找与所请求数据最近的过去数据：PostgreSQL return exact or closest date to queried date

Answer 2

WITH
max_per_day_location AS (
SELECT
    SH.happened_at::DATE as day,
    SH.location_id,
    max(SH.happened_at) as happened_at
FROM
    score_history SH
GROUP BY
    SH.happened_at::DATE,
    SH.location_id
),
date_location AS (
SELECT DISTINCT
    DD."date",
    SH.location_id
FROM
    dw_dim_date DD,
    max_per_day_location SH
),
value_partition AS (
SELECT
    DD."date",
    DD.location_id,
    SH.score,
    SH.happened_at,
    MPD.happened_at as hap2,
    sum(case when score is null then 0 else 1 end) OVER
    (PARTITION BY DD.location_id ORDER BY "date", SH.happened_at desc) AS value_partition
FROM
    date_location DD
    LEFT JOIN score_history SH
    ON DD."date" = SH.happened_at::DATE
    AND DD.location_id = SH.location_id
    LEFT join max_per_day_location MPD
    ON SH.happened_at = MPD.happened_at
WHERE NOT (MPD.happened_at IS NULL
           AND
           SH.happened_at IS NOT NULL)
ORDER BY
    DD."date"
),
final AS (
SELECT
    "date",
    location_id,
    first_value(score) over w
FROM
    value_partition
WINDOW w AS (PARTITION BY location_id, value_partition
             ORDER BY happened_at rows between unbounded preceding and unbounded following)
order by "date"
)
SELECT DISTINCT * FROM final ORDER BY location_id, date
;

我相信有更简单的方法可以做到这一点。

我这里有一个带有一些测试数据的 SQLFiddle： http://sqlfiddle.com/#!17/9d122/1

使这项工作起作用的主要事情是使 "value partition" 访问先前的非空值。更多信息：

How do I efficiently select the previous non-null value?
https://dba.stackexchange.com/questions/156068/using-window-function-to-carry-forward-first-non-null-value-in-a-partition

date_location 子查询每天每 location_id 生成一行，因为这是输出中所需的基础 "row level"。

max_per_day_location 子查询用于过滤掉具有多个分数的 location/day 组合的较早条目，并且只保留当天的最后一个。

Answer 3

您可以使用相关子查询和 LATERAL:

来实现它

SELECT sub.date, sub.location_id, score
FROM (SELECT * FROM dw_dim_date
      CROSS JOIN (SELECT DISTINCT location_id FROM score_history) s
      WHERE date >= '2019-01-01'::date) sub
,LATERAL(SELECT score FROM score_history sc 
         WHERE sc.happened_at::date <= sub.date
           AND sc.location_id = sub.location_id
         ORDER BY happened_at DESC LIMIT 1) l
,LATERAL(SELECT MIN(happened_at::date) m1, MAX(happened_at::date) m2 
         FROM score_history sc
         WHERE sc.location_id = sub.location_id) lm
WHERE sub.date BETWEEN lm.m1 AND lm.m2
ORDER BY location_id, date;

db<>fiddle demo

工作原理：

1) s（它是每个 location_id 的所有日期的交叉连接）

2) l（选择每个位置的分数）

3) lm（每个位置选择 min/max 日期进行过滤）

4) WHERE 在可用范围内过滤日期，如果需要可以放宽

Answer 4

最简单的解决方案可能是：

    select dw_dim_date.date, location_id, score
    from dw_dim_date, score_history S1
    where happened_at::date  <= dw_dim_date.date and 
          not exists (select * 
                      from score_history S2 
                      where S2.happened_at::date  <= dw_dim_date.date and 
                            S1.happened_at< S2.happened_at and
                            S1.location_id = S2.location_id)

这会计算日期和分数历史记录之间的笛卡尔积，然后为每个日期和位置获取不存在较晚分数（在日期期间内）的分数。我建议从这个开始，因为它可能是最容易维护的，并且只有在效率不够高（使用适当的索引）时才使用更复杂的解决方案。

因为 SQL Fiddle 在 https://dbfiddle.uk/?rdbms=postgres_9.4&fiddle=3c2e4ae49cbc43f7840b942d223be119

从仅在值更改时记录的 table 推断每日历史值 (Postgresql 9.3)

Extrapolate daily historical values from a table that only records when a value changes (Postgresql 9.3)

sql

postgresql

data-warehouse

postgresql-9.3