用缺少日期的数据填充 table(postgresql、redshift)
Fill the table with data for missing date (postgresql, redshift)
我正在尝试填写缺失日期的每日数据,但找不到答案,请帮忙。
我的daily_table
例子:
url | timestamp_gmt | visitors | hits | other..
-------------------+---------------+----------+-------+-------
www.domain.com/1 | 2016-04-12 | 1231 | 23423 |
www.domain.com/1 | 2016-04-13 | 1374 | 26482 |
www.domain.com/1 | 2016-04-17 | 1262 | 21493 |
www.domain.com/2 | 2016-05-09 | 2345 | 35471 |
预期结果:我想用每个域和每天的数据填充此 table,这只是从以前的 date
:
复制数据
url | timestamp_gmt | visitors | hits | other..
-------------------+---------------+----------+-------+-------
www.domain.com/1 | 2016-04-12 | 1231 | 23423 |
www.domain.com/1 | 2016-04-13 | 1374 | 26482 |
www.domain.com/1 | 2016-04-14 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-15 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-16 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-17 | 1262 | 21493 |
www.domain.com/2 | 2016-05-09 | 2345 | 35471 |
我可以将一部分逻辑移动到 php,但这是不可取的,因为我的 table 有数十亿个缺失日期。
摘要:
最近几天我发现:
- Amazon-redshift 适用于第 8 版的 PostgreSql,这就是它不支持像
JOIN LATERAL
这样漂亮的命令的原因
- Redshift 也不支持
generate_series
和 CTEs
- 但是它支持简单的
WITH
(谢谢@systemjack)但是WITH RECURSIVE
不支持
查看查询背后的想法:
select distinct on (domain, new_date) *
from (
select new_date::date
from generate_series('2016-04-12', '2016-04-17', '1d'::interval) new_date
) s
left join a_table t on date <= new_date
order by domain, new_date, date desc;
new_date | domain | date | visitors | hits
------------+-----------------+------------+----------+-------
2016-04-12 | www.domain1.com | 2016-04-12 | 1231 | 23423
2016-04-13 | www.domain1.com | 2016-04-13 | 1374 | 26482
2016-04-14 | www.domain1.com | 2016-04-13 | 1374 | 26482
2016-04-15 | www.domain1.com | 2016-04-13 | 1374 | 26482
2016-04-16 | www.domain1.com | 2016-04-13 | 1374 | 26482
2016-04-17 | www.domain1.com | 2016-04-17 | 1262 | 21493
(6 rows)
您必须根据自己的要求选择开始日期和结束日期。
该查询可能非常昂贵(您提到了数十亿的差距),因此请谨慎应用(在较小的数据子集上测试或分阶段执行)。
在没有 generate_series()
的情况下,您可以创建自己的生成器。 Here is an interesting example。可以使用引用文章中的观点代替 generate_series()
。例如,如果您需要句点 '2016-04-12' + 5 days
:
select distinct on (domain, new_date) *
from (
select '2016-04-12'::date+ n new_date
from generator_16
where n < 6
) s
left join a_table t on date <= new_date
order by domain, new_date, date desc;
您将得到与第一个示例相同的结果。
这里有一个丑陋的 hack,可以让红移在这种情况下使用日期将新行生成到 table 中。此示例将输出限制为前 30 天。可以调整或删除范围。同样的方法也可以用于分钟、秒等。
with days as (
select (dateadd(day, -row_number() over (order by true), sysdate::date+'1 day'::interval)) as day
from stv_blocklist limit 30
)
select day from days order by day
要定位特定的时间范围,请将 sysdate
更改为文字,即您想要的范围结束后的最后一天以及涵盖的天数限制。
插入内容应该是这样的:
with days as (
select (dateadd(day, -row_number() over (order by true), sysdate::date+'1 day'::interval)) as day
from stv_blocklist limit 30
)
insert into your_table (domain, date) (
select dns.domain, d.day
from days d
cross join (select distinct(domain) from your_table) dns
left join your_table y on y.domain=dns.domain and y.date=d.day
where y.date is null
)
我无法测试插件,因此可能需要进行一些调整。
对 stv_blocklist
table 的引用可以是任何 table ,其中包含足够多的行以覆盖 with 子句中的范围限制,并用于为row_number()
window 函数。
一旦你有了只有日期的行,你就可以用最近的完整记录更新它们,如下所示:
update your_table set visitors=t.visitors, hits=t.hits
from (
select a.domain, a.date, b.visitors, b.hits
from your_table a
inner join your_table b
on b.domain=a.domain and b.date=(SELECT max(date) FROM your_table where domain=a.domain and hits is not null and date < a.date)
where a.hits is null
) t
where your_table.domain=t.domain and your_table.date=t.date
这很慢,但对于较小的数据集或一次性数据应该没问题。我能够测试类似的查询。
更新:我认为这个版本的查询填充空应该工作得更好并且考虑到域和日期。我测试了一个类似的版本。
update your_table set visitors=t.prev_visitors, hits=t.prev_hits
from (
select domain, date, hits
lag(visitors,1) ignore nulls over (partition by domain order by date) as prev_visitors,
lag(hits,1) ignore nulls over (partition by domain order by date) as prev_hits
from your_table
) t
where t.hits is null and your_table.domain=t.domain and your_table.date=t.date
应该可以将其与初始人口查询结合起来,一次完成。
替代解决方案,避免所有 "modern" 功能 ;-]
-- \i tmp.sql
-- NOTE: date and domain are keywords in SQL
CREATE TABLE ztable
( zdomain TEXT NOT NULL
, zdate DATE NOT NULL
, visitors INTEGER NOT NULL DEFAULT 0
, hits INTEGER NOT NULL DEFAULT 0
, PRIMARY KEY (zdomain,zdate)
);
INSERT INTO ztable (zdomain,zdate,visitors,hits) VALUES
('www.domain1.com', '2016-04-12' ,1231 ,23423 )
,('www.domain1.com', '2016-04-13' ,1374 ,26482 )
,('www.domain1.com', '2016-04-17' ,1262 ,21493 )
,('www.domain3.com', '2016-04-14' ,3245 ,53471 ) -- << cheating!
,('www.domain3.com', '2016-04-15' ,2435 ,34571 )
,('www.domain3.com', '2016-04-16' ,2354 ,35741 )
,('www.domain2.com', '2016-05-09' ,2345 ,35471 ) ;
-- Create "Calendar" table with all possible dates
-- from the existing data in ztable.
-- [if there are sufficient different domains
-- in ztable there will be no gaps]
-- [Normally the table would be filled by generate_series()
-- or even a recursive CTE]
-- An exta advantage is that a table can be indexed.
CREATE TABLE date_domain AS
SELECT DISTINCT zdate AS zdate
FROM ztable;
ALTER TABLE date_domain ADD PRIMARY KEY (zdate);
-- SELECT * FROM date_domain;
-- Finding the closest previous record
-- without using window functions or aggregate queries.
SELECT d.zdate, t.zdate, t.zdomain
,t.visitors, t.hits
, (d.zdate <> t.zdate) AS is_fake -- for fun
FROM date_domain d
LEFT JOIN ztable t
ON t.zdate <= d.zdate
AND NOT EXISTS ( SELECT * FROM ztable nx
WHERE nx.zdomain = t.zdomain
AND nx.zdate > d.zdate
AND nx.zdate < t.zdate
)
ORDER BY t.zdomain, d.zdate
;
终于完成了我的任务,我想分享一些有用的东西。
我使用了这个钩子而不是 generate_series
:
WITH date_range AS (
SELECT trunc(current_date - (row_number() OVER ())) AS date
FROM any_table -- any of your table which has enough data
LIMIT 365
) SELECT * FROM date_range;
要获取我必须用数据填充的 URL 列表,我使用了这个:
WITH url_list AS (
SELECT
url AS gapsed_url,
MIN(timestamp_gmt) AS min_date,
MAX(timestamp_gmt) AS max_date
FROM daily_table
WHERE url IN (
SELECT url FROM daily_table GROUP BY url
HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
)
GROUP BY url
) SELECT * FROM url_list;
然后我组合给定的数据,我们称之为url_mapping
:
SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;
为了在最近日期之前获取数据,我执行了以下操作:
SELECT sd.*
FROM url_mapping AS um JOIN daily_table AS sd
ON um.gapsed_url = sd.url AND (
sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
)
希望对大家有所帮助。
我正在尝试填写缺失日期的每日数据,但找不到答案,请帮忙。
我的daily_table
例子:
url | timestamp_gmt | visitors | hits | other..
-------------------+---------------+----------+-------+-------
www.domain.com/1 | 2016-04-12 | 1231 | 23423 |
www.domain.com/1 | 2016-04-13 | 1374 | 26482 |
www.domain.com/1 | 2016-04-17 | 1262 | 21493 |
www.domain.com/2 | 2016-05-09 | 2345 | 35471 |
预期结果:我想用每个域和每天的数据填充此 table,这只是从以前的 date
:
url | timestamp_gmt | visitors | hits | other..
-------------------+---------------+----------+-------+-------
www.domain.com/1 | 2016-04-12 | 1231 | 23423 |
www.domain.com/1 | 2016-04-13 | 1374 | 26482 |
www.domain.com/1 | 2016-04-14 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-15 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-16 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-17 | 1262 | 21493 |
www.domain.com/2 | 2016-05-09 | 2345 | 35471 |
我可以将一部分逻辑移动到 php,但这是不可取的,因为我的 table 有数十亿个缺失日期。
摘要:
最近几天我发现:
- Amazon-redshift 适用于第 8 版的 PostgreSql,这就是它不支持像
JOIN LATERAL
这样漂亮的命令的原因
- Redshift 也不支持
generate_series
和CTEs
- 但是它支持简单的
WITH
(谢谢@systemjack)但是WITH RECURSIVE
不支持
查看查询背后的想法:
select distinct on (domain, new_date) *
from (
select new_date::date
from generate_series('2016-04-12', '2016-04-17', '1d'::interval) new_date
) s
left join a_table t on date <= new_date
order by domain, new_date, date desc;
new_date | domain | date | visitors | hits
------------+-----------------+------------+----------+-------
2016-04-12 | www.domain1.com | 2016-04-12 | 1231 | 23423
2016-04-13 | www.domain1.com | 2016-04-13 | 1374 | 26482
2016-04-14 | www.domain1.com | 2016-04-13 | 1374 | 26482
2016-04-15 | www.domain1.com | 2016-04-13 | 1374 | 26482
2016-04-16 | www.domain1.com | 2016-04-13 | 1374 | 26482
2016-04-17 | www.domain1.com | 2016-04-17 | 1262 | 21493
(6 rows)
您必须根据自己的要求选择开始日期和结束日期。 该查询可能非常昂贵(您提到了数十亿的差距),因此请谨慎应用(在较小的数据子集上测试或分阶段执行)。
在没有 generate_series()
的情况下,您可以创建自己的生成器。 Here is an interesting example。可以使用引用文章中的观点代替 generate_series()
。例如,如果您需要句点 '2016-04-12' + 5 days
:
select distinct on (domain, new_date) *
from (
select '2016-04-12'::date+ n new_date
from generator_16
where n < 6
) s
left join a_table t on date <= new_date
order by domain, new_date, date desc;
您将得到与第一个示例相同的结果。
这里有一个丑陋的 hack,可以让红移在这种情况下使用日期将新行生成到 table 中。此示例将输出限制为前 30 天。可以调整或删除范围。同样的方法也可以用于分钟、秒等。
with days as (
select (dateadd(day, -row_number() over (order by true), sysdate::date+'1 day'::interval)) as day
from stv_blocklist limit 30
)
select day from days order by day
要定位特定的时间范围,请将 sysdate
更改为文字,即您想要的范围结束后的最后一天以及涵盖的天数限制。
插入内容应该是这样的:
with days as (
select (dateadd(day, -row_number() over (order by true), sysdate::date+'1 day'::interval)) as day
from stv_blocklist limit 30
)
insert into your_table (domain, date) (
select dns.domain, d.day
from days d
cross join (select distinct(domain) from your_table) dns
left join your_table y on y.domain=dns.domain and y.date=d.day
where y.date is null
)
我无法测试插件,因此可能需要进行一些调整。
对 stv_blocklist
table 的引用可以是任何 table ,其中包含足够多的行以覆盖 with 子句中的范围限制,并用于为row_number()
window 函数。
一旦你有了只有日期的行,你就可以用最近的完整记录更新它们,如下所示:
update your_table set visitors=t.visitors, hits=t.hits
from (
select a.domain, a.date, b.visitors, b.hits
from your_table a
inner join your_table b
on b.domain=a.domain and b.date=(SELECT max(date) FROM your_table where domain=a.domain and hits is not null and date < a.date)
where a.hits is null
) t
where your_table.domain=t.domain and your_table.date=t.date
这很慢,但对于较小的数据集或一次性数据应该没问题。我能够测试类似的查询。
更新:我认为这个版本的查询填充空应该工作得更好并且考虑到域和日期。我测试了一个类似的版本。
update your_table set visitors=t.prev_visitors, hits=t.prev_hits
from (
select domain, date, hits
lag(visitors,1) ignore nulls over (partition by domain order by date) as prev_visitors,
lag(hits,1) ignore nulls over (partition by domain order by date) as prev_hits
from your_table
) t
where t.hits is null and your_table.domain=t.domain and your_table.date=t.date
应该可以将其与初始人口查询结合起来,一次完成。
替代解决方案,避免所有 "modern" 功能 ;-]
-- \i tmp.sql
-- NOTE: date and domain are keywords in SQL
CREATE TABLE ztable
( zdomain TEXT NOT NULL
, zdate DATE NOT NULL
, visitors INTEGER NOT NULL DEFAULT 0
, hits INTEGER NOT NULL DEFAULT 0
, PRIMARY KEY (zdomain,zdate)
);
INSERT INTO ztable (zdomain,zdate,visitors,hits) VALUES
('www.domain1.com', '2016-04-12' ,1231 ,23423 )
,('www.domain1.com', '2016-04-13' ,1374 ,26482 )
,('www.domain1.com', '2016-04-17' ,1262 ,21493 )
,('www.domain3.com', '2016-04-14' ,3245 ,53471 ) -- << cheating!
,('www.domain3.com', '2016-04-15' ,2435 ,34571 )
,('www.domain3.com', '2016-04-16' ,2354 ,35741 )
,('www.domain2.com', '2016-05-09' ,2345 ,35471 ) ;
-- Create "Calendar" table with all possible dates
-- from the existing data in ztable.
-- [if there are sufficient different domains
-- in ztable there will be no gaps]
-- [Normally the table would be filled by generate_series()
-- or even a recursive CTE]
-- An exta advantage is that a table can be indexed.
CREATE TABLE date_domain AS
SELECT DISTINCT zdate AS zdate
FROM ztable;
ALTER TABLE date_domain ADD PRIMARY KEY (zdate);
-- SELECT * FROM date_domain;
-- Finding the closest previous record
-- without using window functions or aggregate queries.
SELECT d.zdate, t.zdate, t.zdomain
,t.visitors, t.hits
, (d.zdate <> t.zdate) AS is_fake -- for fun
FROM date_domain d
LEFT JOIN ztable t
ON t.zdate <= d.zdate
AND NOT EXISTS ( SELECT * FROM ztable nx
WHERE nx.zdomain = t.zdomain
AND nx.zdate > d.zdate
AND nx.zdate < t.zdate
)
ORDER BY t.zdomain, d.zdate
;
终于完成了我的任务,我想分享一些有用的东西。
我使用了这个钩子而不是 generate_series
:
WITH date_range AS (
SELECT trunc(current_date - (row_number() OVER ())) AS date
FROM any_table -- any of your table which has enough data
LIMIT 365
) SELECT * FROM date_range;
要获取我必须用数据填充的 URL 列表,我使用了这个:
WITH url_list AS (
SELECT
url AS gapsed_url,
MIN(timestamp_gmt) AS min_date,
MAX(timestamp_gmt) AS max_date
FROM daily_table
WHERE url IN (
SELECT url FROM daily_table GROUP BY url
HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
)
GROUP BY url
) SELECT * FROM url_list;
然后我组合给定的数据,我们称之为url_mapping
:
SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;
为了在最近日期之前获取数据,我执行了以下操作:
SELECT sd.*
FROM url_mapping AS um JOIN daily_table AS sd
ON um.gapsed_url = sd.url AND (
sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
)
希望对大家有所帮助。