填补组内日期和变量的空白 - PostgreSQL
Fill gaps of dates and variable within group - PostgreSQL
我的价格 table 在两个主要变量、日期 (sales_date) 和销售渠道 (channel) 中存在差距。我需要为所有可能的 skus (ean) 和客户 (id_client) 的组合填补这些空白。
此时我已经能够填写日期和频道,但在某些情况下,多个频道在同一日期共享,在那些 "bizarre" 情况下,我的方法是复制所有内容。
表格
create table prices_master (
id_price serial primary key,
sales_date date,
ean varchar(15),
id_client int,
channel varchar(15),
price float
);
create table channels_master (
id_channel serial primary key,
channel varchar(15)
);
insert into prices_master (sales_date, ean, id_client, channel, price)
values
('2015-07-01', '7506205801143', 7, 'COMERCIAL',47655),
('2015-08-01', '7506205801143', 7, 'COMERCIAL',51655),
('2015-12-01', '7506205801143', 7, 'COMERCIAL', 55667),
('2015-12-01', '7506205801143', 7, 'DISTRIBUIDOR', 35667),
('2015-07-01', '5052197008555', 7, 'DISTRIBUIDOR', 7224),
('2015-10-01', '5052197008555', 7, 'DISTRIBUIDOR', 8224);
insert into channels_master (channel) values
('DISTRIBUIDOR'), ('INSTITUCIONAL'), ('NON_TRADE'), ('COMERCIAL');
我的做法
WITH full_dates AS (
WITH min_max AS (
SELECT min(prm.sales_date) AS min_date, ((max(prm.sales_date))) :: date AS max_date
FROM prices_master prm
)
SELECT generate_series((min_max.min_date) :: timestamp with time zone,
(min_max.max_date) :: timestamp with time zone, '1 mon' :: interval) AS sales_date
FROM min_max),
completechannels AS (
SELECT DISTINCT channel
FROM channels_master
),
temp AS (
SELECT prices_master.sales_date,
prices_master.id_client,
prices_master.ean,
prices_master.channel,
prices_master.price,
lead(
prices_master.sales_date) OVER (PARTITION BY prices_master.id_client, prices_master.ean, prices_master.channel ORDER BY prices_master.sales_date) AS next_sales_date
FROM prices_master
ORDER BY prices_master.id_client, prices_master.ean, prices_master.channel, prices_master.sales_date
)
SELECT (full_dates.sales_date) :: date AS sales_date,
temp.id_client,
temp.ean,
completechannels.channel,
price
FROM full_dates
JOIN temp ON full_dates.sales_date >= temp.sales_date AND
(full_dates.sales_date < temp.next_sales_date OR temp.next_sales_date IS NULL)
JOIN completechannels ON 1=1
ORDER BY temp.id_client, temp.ean, completechannels.channel,
full_dates.sales_date;
我的问题出现在代码 7506205801143 sales_date 2015-12-01 上,因为此代码对两者都有价格频道 DISTRIBUIDOR 和 COMERCIAL 我的方法是复制行:
我的接近结果(不好)
+------------+-----------+---------------+---------------+-------+
| sales_date | id_client | ean | channel | price |
+------------+-----------+---------------+---------------+-------+
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 35667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 55667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 35667 |
+------------+-----------+---------------+---------------+-------+
预期结果(良好)
+------------+-----------+---------------+---------------+-------+
| sales_date | id_client | ean | channel | price |
+------------+-----------+---------------+---------------+-------+
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
+------------+-----------+---------------+---------------+-------+
对于INSTITUTIONAL和NON_TRADE,最高价可用于填补缺口。
您可以尝试在子查询中通过 sales_date
DESC 使用 ROW_NUMBER
window 函数来获取每个 channel
的最大行数据
然后用coalesce
和MAX
window函数来制作。
查询 1:
WITH pricesCTE as (
SELECT price,sales_date,id_client,ean,cm.channel,ROW_NUMBER() OVER(PARTITION BY cm.channel ORDER BY sales_date DESC) rn
FROM (SELECT DISTINCT channel FROM channels_master) cm
LEFT JOIN prices_master pm on pm.channel = cm.channel
)
SELECT
coalesce(sales_date,MAX(sales_date) OVER(ORDER BY coalesce(price,0) DESC)) sales_date,
coalesce(id_client,MAX(id_client) OVER(ORDER BY coalesce(price,0) DESC)) id_client,
coalesce(ean,MAX(ean) OVER(ORDER BY coalesce(price,0) DESC)) ean,
channel,
coalesce(price,MAX(price) OVER(ORDER BY coalesce(price,0) DESC)) price
FROM
(
select *
from pricesCTE
where rn = 1
) t1
| sales_date | id_client | ean | channel | price |
|------------|-----------|---------------|---------------|-------|
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
你会发现这更容易通过将其中的一些翻转过来并考虑主要价格 table 来 overrides。也就是说,您想要构造 "base" table 只是 date
/client
/ean
元组的(最大)价格,以及 忽略频道,稍后再看。
首先,您需要将以下 CTE 添加到您已有的 CTE 中(formatting/naming 更新为我通常的风格):
Maximum_Price_Per_Date AS (
SELECT Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean,
MAX(Price_Date_Range.price) AS price
FROM Date_Range
JOIN Price_Date_Range -- aka TEMP in your original query
ON Price_Date_Range.sales_date <= Date_Range.sales_date
AND (Price_Date_Range.next_sales_date > Date_Range.sales_date OR Price_Date_Range.next_sales_date IS NULL)
GROUP BY Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean
)
这使得笛卡尔积的集合乘法(JOIN completechannels ON 1=1
- 虽然通常通过 CROSS JOIN
完成)与你一起工作:现在不会有额外的行:
SELECT Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean,
Channel.channel,
Maximum_Price_Per_Date.price
FROM Maximum_Price_Per_Date
CROSS JOIN (SELECT DISTINCT channel
FROM Channels_Master) Channel
生成(省略无意义的行):
| sales_date | channel | id_client | ean | price |
|------------|---------|---------------|---------------|-------|
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 55667 |
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
现在我们只需要 LEFT JOIN
再次(再次)回到 Price_Date_Range
CTE,使用那里的价格,如果存在的话:
-- Note that you should have a Calendar table, which would remove this.
WITH Date_Range AS (
-- You probably should be using an explicit range here, to account for future dates.
WITH Min_Max AS (
SELECT MIN(sales_date) AS min_date, MAX(sales_date) AS max_date
FROM Prices_Master
),
Timezone_Range AS (
SELECT GENERATE_SERIES(min_date, max_date, CAST('1 mon' AS INTERVAL)) AS sales_date
FROM Min_Max
)
SELECT CAST(sales_date AS DATE) AS sales_date
FROM Timezone_Range
),
-- This would really benefit by being a MQT - materialized query table
Price_Date_Range AS (
SELECT sales_date, lead(sales_date) OVER (PARTITION BY id_client, ean, channel ORDER BY sales_date) AS next_sales_date,
id_client, ean, channel, price
FROM Prices_Master
),
Maximum_Price_Per_Date AS (
SELECT Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean,
MAX(Price_Date_Range.price) AS price
FROM Date_Range
JOIN Price_Date_Range
ON Price_Date_Range.sales_date <= Date_Range.sales_date
AND (Price_Date_Range.next_sales_date > Date_Range.sales_date OR Price_Date_Range.next_sales_date IS NULL)
GROUP BY Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean
)
SELECT Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean,
Channel.channel,
COALESCE(Price_Date_Range.price, Maximum_Price_Per_Date.price) AS price
FROM Maximum_Price_Per_Date
CROSS JOIN (SELECT DISTINCT channel
FROM Channels_Master) Channel
LEFT JOIN Price_Date_Range
ON Price_Date_Range.channel = Channel.channel
AND Price_Date_Range.id_client = Maximum_Price_Per_Date.id_client
AND Price_Date_Range.ean = Maximum_Price_Per_Date.ean
AND Price_Date_Range.sales_date <= Maximum_Price_Per_Date.sales_date
AND (Price_Date_Range.next_sales_date > Maximum_Price_Per_Date.sales_date OR Price_Date_Range.next_sales_date IS NULL)
ORDER BY Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean, Channel.channel
Fiddle example
(感谢@D-Shih 的设置)
生成(省略无意义的行):
| sales_date | channel | id_client | ean | price |
|------------|---------|---------------|---------------|-------|
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
我的价格 table 在两个主要变量、日期 (sales_date) 和销售渠道 (channel) 中存在差距。我需要为所有可能的 skus (ean) 和客户 (id_client) 的组合填补这些空白。
此时我已经能够填写日期和频道,但在某些情况下,多个频道在同一日期共享,在那些 "bizarre" 情况下,我的方法是复制所有内容。
表格
create table prices_master (
id_price serial primary key,
sales_date date,
ean varchar(15),
id_client int,
channel varchar(15),
price float
);
create table channels_master (
id_channel serial primary key,
channel varchar(15)
);
insert into prices_master (sales_date, ean, id_client, channel, price)
values
('2015-07-01', '7506205801143', 7, 'COMERCIAL',47655),
('2015-08-01', '7506205801143', 7, 'COMERCIAL',51655),
('2015-12-01', '7506205801143', 7, 'COMERCIAL', 55667),
('2015-12-01', '7506205801143', 7, 'DISTRIBUIDOR', 35667),
('2015-07-01', '5052197008555', 7, 'DISTRIBUIDOR', 7224),
('2015-10-01', '5052197008555', 7, 'DISTRIBUIDOR', 8224);
insert into channels_master (channel) values
('DISTRIBUIDOR'), ('INSTITUCIONAL'), ('NON_TRADE'), ('COMERCIAL');
我的做法
WITH full_dates AS (
WITH min_max AS (
SELECT min(prm.sales_date) AS min_date, ((max(prm.sales_date))) :: date AS max_date
FROM prices_master prm
)
SELECT generate_series((min_max.min_date) :: timestamp with time zone,
(min_max.max_date) :: timestamp with time zone, '1 mon' :: interval) AS sales_date
FROM min_max),
completechannels AS (
SELECT DISTINCT channel
FROM channels_master
),
temp AS (
SELECT prices_master.sales_date,
prices_master.id_client,
prices_master.ean,
prices_master.channel,
prices_master.price,
lead(
prices_master.sales_date) OVER (PARTITION BY prices_master.id_client, prices_master.ean, prices_master.channel ORDER BY prices_master.sales_date) AS next_sales_date
FROM prices_master
ORDER BY prices_master.id_client, prices_master.ean, prices_master.channel, prices_master.sales_date
)
SELECT (full_dates.sales_date) :: date AS sales_date,
temp.id_client,
temp.ean,
completechannels.channel,
price
FROM full_dates
JOIN temp ON full_dates.sales_date >= temp.sales_date AND
(full_dates.sales_date < temp.next_sales_date OR temp.next_sales_date IS NULL)
JOIN completechannels ON 1=1
ORDER BY temp.id_client, temp.ean, completechannels.channel,
full_dates.sales_date;
我的问题出现在代码 7506205801143 sales_date 2015-12-01 上,因为此代码对两者都有价格频道 DISTRIBUIDOR 和 COMERCIAL 我的方法是复制行:
我的接近结果(不好)
+------------+-----------+---------------+---------------+-------+
| sales_date | id_client | ean | channel | price |
+------------+-----------+---------------+---------------+-------+
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 35667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 55667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 35667 |
+------------+-----------+---------------+---------------+-------+
预期结果(良好)
+------------+-----------+---------------+---------------+-------+
| sales_date | id_client | ean | channel | price |
+------------+-----------+---------------+---------------+-------+
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
+------------+-----------+---------------+---------------+-------+
对于INSTITUTIONAL和NON_TRADE,最高价可用于填补缺口。
您可以尝试在子查询中通过 sales_date
DESC 使用 ROW_NUMBER
window 函数来获取每个 channel
然后用coalesce
和MAX
window函数来制作。
查询 1:
WITH pricesCTE as (
SELECT price,sales_date,id_client,ean,cm.channel,ROW_NUMBER() OVER(PARTITION BY cm.channel ORDER BY sales_date DESC) rn
FROM (SELECT DISTINCT channel FROM channels_master) cm
LEFT JOIN prices_master pm on pm.channel = cm.channel
)
SELECT
coalesce(sales_date,MAX(sales_date) OVER(ORDER BY coalesce(price,0) DESC)) sales_date,
coalesce(id_client,MAX(id_client) OVER(ORDER BY coalesce(price,0) DESC)) id_client,
coalesce(ean,MAX(ean) OVER(ORDER BY coalesce(price,0) DESC)) ean,
channel,
coalesce(price,MAX(price) OVER(ORDER BY coalesce(price,0) DESC)) price
FROM
(
select *
from pricesCTE
where rn = 1
) t1
| sales_date | id_client | ean | channel | price |
|------------|-----------|---------------|---------------|-------|
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
你会发现这更容易通过将其中的一些翻转过来并考虑主要价格 table 来 overrides。也就是说,您想要构造 "base" table 只是 date
/client
/ean
元组的(最大)价格,以及 忽略频道,稍后再看。
首先,您需要将以下 CTE 添加到您已有的 CTE 中(formatting/naming 更新为我通常的风格):
Maximum_Price_Per_Date AS (
SELECT Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean,
MAX(Price_Date_Range.price) AS price
FROM Date_Range
JOIN Price_Date_Range -- aka TEMP in your original query
ON Price_Date_Range.sales_date <= Date_Range.sales_date
AND (Price_Date_Range.next_sales_date > Date_Range.sales_date OR Price_Date_Range.next_sales_date IS NULL)
GROUP BY Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean
)
这使得笛卡尔积的集合乘法(JOIN completechannels ON 1=1
- 虽然通常通过 CROSS JOIN
完成)与你一起工作:现在不会有额外的行:
SELECT Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean,
Channel.channel,
Maximum_Price_Per_Date.price
FROM Maximum_Price_Per_Date
CROSS JOIN (SELECT DISTINCT channel
FROM Channels_Master) Channel
生成(省略无意义的行):
| sales_date | channel | id_client | ean | price |
|------------|---------|---------------|---------------|-------|
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 55667 |
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
现在我们只需要 LEFT JOIN
再次(再次)回到 Price_Date_Range
CTE,使用那里的价格,如果存在的话:
-- Note that you should have a Calendar table, which would remove this.
WITH Date_Range AS (
-- You probably should be using an explicit range here, to account for future dates.
WITH Min_Max AS (
SELECT MIN(sales_date) AS min_date, MAX(sales_date) AS max_date
FROM Prices_Master
),
Timezone_Range AS (
SELECT GENERATE_SERIES(min_date, max_date, CAST('1 mon' AS INTERVAL)) AS sales_date
FROM Min_Max
)
SELECT CAST(sales_date AS DATE) AS sales_date
FROM Timezone_Range
),
-- This would really benefit by being a MQT - materialized query table
Price_Date_Range AS (
SELECT sales_date, lead(sales_date) OVER (PARTITION BY id_client, ean, channel ORDER BY sales_date) AS next_sales_date,
id_client, ean, channel, price
FROM Prices_Master
),
Maximum_Price_Per_Date AS (
SELECT Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean,
MAX(Price_Date_Range.price) AS price
FROM Date_Range
JOIN Price_Date_Range
ON Price_Date_Range.sales_date <= Date_Range.sales_date
AND (Price_Date_Range.next_sales_date > Date_Range.sales_date OR Price_Date_Range.next_sales_date IS NULL)
GROUP BY Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean
)
SELECT Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean,
Channel.channel,
COALESCE(Price_Date_Range.price, Maximum_Price_Per_Date.price) AS price
FROM Maximum_Price_Per_Date
CROSS JOIN (SELECT DISTINCT channel
FROM Channels_Master) Channel
LEFT JOIN Price_Date_Range
ON Price_Date_Range.channel = Channel.channel
AND Price_Date_Range.id_client = Maximum_Price_Per_Date.id_client
AND Price_Date_Range.ean = Maximum_Price_Per_Date.ean
AND Price_Date_Range.sales_date <= Maximum_Price_Per_Date.sales_date
AND (Price_Date_Range.next_sales_date > Maximum_Price_Per_Date.sales_date OR Price_Date_Range.next_sales_date IS NULL)
ORDER BY Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean, Channel.channel
Fiddle example
(感谢@D-Shih 的设置)
生成(省略无意义的行):
| sales_date | channel | id_client | ean | price |
|------------|---------|---------------|---------------|-------|
| 2015-12-01 | 7 | 7506205801143 | COMERCIAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | DISTRIBUIDOR | 35667 |
| 2015-12-01 | 7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7 | 7506205801143 | NON_TRADE | 55667 |