如何在时间跨度内包含多个分组的缺失数据?
How to include missing data for multiple groupings within the time span?
我有以下参考查询,该查询按教师、学习年月和过去 12 个月(包括当前月份)的房间对学习计数进行分组。我得到的结果是正确的,但是,我想在数据丢失时包含计数为零的行。
我查看了其他几个相关帖子,但无法获得所需的输出:
- Postgres - how to return rows with 0 count for missing data?
- Postgresql group month wise with missing values
- Best way to count records by arbitrary time intervals in Rails+Postgres
这里是查询:
SELECT
upper(trim(t.full_name)) AS teacher
, date_trunc('month', s.study_dt)::date AS study_month
, r.room_code AS room
, COUNT(1) AS study_count
FROM
studies AS s
LEFT OUTER JOIN rooms AS r ON r.id = s.room_id
LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id
LEFT OUTER JOIN teachers AS t ON t.id = tc.teacher_id
WHERE
s.study_dt BETWEEN now() - interval '13 month' AND now()
AND s.study_dt IS NOT NULL
GROUP BY
teacher
, study_month
, room
ORDER BY
teacher
, study_month
, room;
我得到的输出:
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-12-01","A2",1
"DOE, JOHN","2016-01-01","B1",1
"SIMPSON, HOMER","2016-05-01","B2",3
"MOUSE, MICKEY","2015-08-01","A2",1
"MOUSE, MICKEY","2015-11-01","B1",1
"MOUSE, MICKEY","2015-11-01","B2",2
但我希望所有缺失的年月和房间组合都显示为 0。例如(仅第一行,一共有4个房间:A1, A2, B1, B2):
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-07-01","A2",0
"DOE, JOHN","2015-07-01","B1",0
"DOE, JOHN","2015-07-01","B2",0
...
"DOE, JOHN","2015-12-01","A1",1
"DOE, JOHN","2015-12-01","A2",0
"DOE, JOHN","2015-12-01","B1",0
"DOE, JOHN","2015-12-01","B2",0
...
为了得到缺失的年-月,我尝试了使用时间序列的左外连接并在 time_range.year_month = study_month
上连接,但它没有用。
SELECT date_trunc('month', time_range)::date AS year_month
FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range
所以,我想知道如何 'fill in the gaps' for
a) 年月和房间,作为奖励:
b) 只是一年一个月。
这样做的原因是数据集将被馈送到一个数据透视库,我们可以获得类似于以下的输出(不能直接在 PG 中执行此操作):
teacher,room,2015-07,...,2015-12,...,2016-07,total
"DOE, JOHN",A1,1,...,1,...,0,2
"DOE, JOHN",A2,0,...,0,...,0,0
...and so on...
您需要使用 cross join
生成所有行,然后加入 studies
并进行聚合以获得计数。
生成的查询应如下所示:
select t.teacher, d.mon, r.room_code, count(s.teacher_contact_id)
from teachers t cross join
rooms r cross join
generate_series(date_trunc('month', now() - interval '13 month',
date_trunc('month', now()),
interval '1 month'
) d(mon) left join
(select distinct date_trunc('month', s.study_dt)::date as mon) d left join
teacher_contacts tc
on tc.teacher_id = t.id left join
studies s
on tc.id = s.teacher_contact_id and
date_trunc('month', s.study_dt) = d.mon
group by t.teacher, d.mon, r.room_code;
基于一些假设(问题中的歧义)我建议:
SELECT upper(trim(t.full_name)) AS teacher
, m.study_month
, r.room_code AS room
, count(s.room_id) AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12!
, date_trunc('month', now())
, interval '1 month') m(study_month)
CROSS JOIN rooms r
LEFT JOIN ( -- parentheses!
studies s
JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id -- INNER JOIN!
) ON tc.teacher_id = t.id
AND s.study_dt >= m.study_month
AND s.study_dt < m.study_month + interval '1 month' -- sargable!
AND s.room_id = r.id
GROUP BY t.id, m.study_month, r.id -- id is PK of respective tables
ORDER BY t.id, m.study_month, r.id;
要点
使用 CROSS JOIN
构建所有所需组合的网格。然后 LEFT JOIN
到现有行。相关:
- array_agg group by and null
在你的例子中,它是几个 table 的连接,所以我在 FROM
列表中使用括号 LEFT JOIN
到 [=66=括号内 INNER JOIN
的 ]result。
这将是 不正确的 到 LEFT JOIN
到每个 table 分别,因为你会包括部分匹配的命中并有可能获得计数不正确。
假设引用完整性并直接使用 PK 列,我们不需要包含 rooms
和 teachers
左侧第二次。但是我们仍然有两个 table 的连接(studies
和 teacher_contacts
)。 teacher_contacts
的作用我不清楚。通常,我希望直接在 studies
和 teachers
之间建立关系。可能会进一步简化...
我们需要对左侧的非空列进行计数以获得所需的计数。喜欢count(s.room_id)
要在大 table 中保持快速,请确保您的谓词是 sargable。并添加匹配的 indexes.
列 teacher
很难(可靠地)唯一。使用唯一 ID 操作,最好是 PK(也更快更简单)。我仍在使用 teacher
输出以匹配您想要的结果。包含唯一 ID 可能是明智的,因为名称可以重复。
你想要:
the past 12 months (including current month).
所以从 date_trunc('month', now() - interval '12 month'
(不是 13)开始。这已经开始四舍五入并且做你想做的 - 比你原来的查询更准确。
由于您提到性能较慢,具体取决于实际 table 定义和数据分布,先聚合然后加入可能更快,就像在这个相关的答案中一样:
- Postgres - how to return rows with 0 count for missing data?
SELECT upper(trim(t.full_name)) AS teacher
, m.mon AS study_month
, r.room_code AS room
, COALESCE(s.ct, 0) AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12!
, date_trunc('month', now())
, interval '1 month') mon
CROSS JOIN rooms r
LEFT JOIN ( -- parentheses!
SELECT tc.teacher_id, date_trunc('month', s.study_dt) AS mon, s.room_id, count(*) AS ct
FROM studies s
JOIN teacher_contacts tc ON s.teacher_contact_id = tc.id
WHERE s.study_dt >= date_trunc('month', now() - interval '12 month') -- sargable
GROUP BY 1, 2, 3
) s ON s.teacher_id = t.id
AND s.mon = m.mon
AND s.room_id = r.id
ORDER BY 1, 2, 3;
关于您的结束语:
the dataset would be fed to a pivot library ... (could not do this in PG directly)
你有可能使用crosstab()
的双参数形式直接产生你想要的结果并且非常好性能和上面的查询不需要开始。考虑:
- PostgreSQL Crosstab Query
我有以下参考查询,该查询按教师、学习年月和过去 12 个月(包括当前月份)的房间对学习计数进行分组。我得到的结果是正确的,但是,我想在数据丢失时包含计数为零的行。
我查看了其他几个相关帖子,但无法获得所需的输出:
- Postgres - how to return rows with 0 count for missing data?
- Postgresql group month wise with missing values
- Best way to count records by arbitrary time intervals in Rails+Postgres
这里是查询:
SELECT
upper(trim(t.full_name)) AS teacher
, date_trunc('month', s.study_dt)::date AS study_month
, r.room_code AS room
, COUNT(1) AS study_count
FROM
studies AS s
LEFT OUTER JOIN rooms AS r ON r.id = s.room_id
LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id
LEFT OUTER JOIN teachers AS t ON t.id = tc.teacher_id
WHERE
s.study_dt BETWEEN now() - interval '13 month' AND now()
AND s.study_dt IS NOT NULL
GROUP BY
teacher
, study_month
, room
ORDER BY
teacher
, study_month
, room;
我得到的输出:
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-12-01","A2",1
"DOE, JOHN","2016-01-01","B1",1
"SIMPSON, HOMER","2016-05-01","B2",3
"MOUSE, MICKEY","2015-08-01","A2",1
"MOUSE, MICKEY","2015-11-01","B1",1
"MOUSE, MICKEY","2015-11-01","B2",2
但我希望所有缺失的年月和房间组合都显示为 0。例如(仅第一行,一共有4个房间:A1, A2, B1, B2):
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-07-01","A2",0
"DOE, JOHN","2015-07-01","B1",0
"DOE, JOHN","2015-07-01","B2",0
...
"DOE, JOHN","2015-12-01","A1",1
"DOE, JOHN","2015-12-01","A2",0
"DOE, JOHN","2015-12-01","B1",0
"DOE, JOHN","2015-12-01","B2",0
...
为了得到缺失的年-月,我尝试了使用时间序列的左外连接并在 time_range.year_month = study_month
上连接,但它没有用。
SELECT date_trunc('month', time_range)::date AS year_month
FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range
所以,我想知道如何 'fill in the gaps' for
a) 年月和房间,作为奖励: b) 只是一年一个月。
这样做的原因是数据集将被馈送到一个数据透视库,我们可以获得类似于以下的输出(不能直接在 PG 中执行此操作):
teacher,room,2015-07,...,2015-12,...,2016-07,total
"DOE, JOHN",A1,1,...,1,...,0,2
"DOE, JOHN",A2,0,...,0,...,0,0
...and so on...
您需要使用 cross join
生成所有行,然后加入 studies
并进行聚合以获得计数。
生成的查询应如下所示:
select t.teacher, d.mon, r.room_code, count(s.teacher_contact_id)
from teachers t cross join
rooms r cross join
generate_series(date_trunc('month', now() - interval '13 month',
date_trunc('month', now()),
interval '1 month'
) d(mon) left join
(select distinct date_trunc('month', s.study_dt)::date as mon) d left join
teacher_contacts tc
on tc.teacher_id = t.id left join
studies s
on tc.id = s.teacher_contact_id and
date_trunc('month', s.study_dt) = d.mon
group by t.teacher, d.mon, r.room_code;
基于一些假设(问题中的歧义)我建议:
SELECT upper(trim(t.full_name)) AS teacher
, m.study_month
, r.room_code AS room
, count(s.room_id) AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12!
, date_trunc('month', now())
, interval '1 month') m(study_month)
CROSS JOIN rooms r
LEFT JOIN ( -- parentheses!
studies s
JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id -- INNER JOIN!
) ON tc.teacher_id = t.id
AND s.study_dt >= m.study_month
AND s.study_dt < m.study_month + interval '1 month' -- sargable!
AND s.room_id = r.id
GROUP BY t.id, m.study_month, r.id -- id is PK of respective tables
ORDER BY t.id, m.study_month, r.id;
要点
使用
CROSS JOIN
构建所有所需组合的网格。然后LEFT JOIN
到现有行。相关:- array_agg group by and null
在你的例子中,它是几个 table 的连接,所以我在
FROM
列表中使用括号LEFT JOIN
到 [=66=括号内INNER JOIN
的 ]result。 这将是 不正确的 到LEFT JOIN
到每个 table 分别,因为你会包括部分匹配的命中并有可能获得计数不正确。假设引用完整性并直接使用 PK 列,我们不需要包含
rooms
和teachers
左侧第二次。但是我们仍然有两个 table 的连接(studies
和teacher_contacts
)。teacher_contacts
的作用我不清楚。通常,我希望直接在studies
和teachers
之间建立关系。可能会进一步简化...我们需要对左侧的非空列进行计数以获得所需的计数。喜欢
count(s.room_id)
要在大 table 中保持快速,请确保您的谓词是 sargable。并添加匹配的 indexes.
列
teacher
很难(可靠地)唯一。使用唯一 ID 操作,最好是 PK(也更快更简单)。我仍在使用teacher
输出以匹配您想要的结果。包含唯一 ID 可能是明智的,因为名称可以重复。你想要:
the past 12 months (including current month).
所以从
date_trunc('month', now() - interval '12 month'
(不是 13)开始。这已经开始四舍五入并且做你想做的 - 比你原来的查询更准确。
由于您提到性能较慢,具体取决于实际 table 定义和数据分布,先聚合然后加入可能更快,就像在这个相关的答案中一样:
- Postgres - how to return rows with 0 count for missing data?
SELECT upper(trim(t.full_name)) AS teacher
, m.mon AS study_month
, r.room_code AS room
, COALESCE(s.ct, 0) AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12!
, date_trunc('month', now())
, interval '1 month') mon
CROSS JOIN rooms r
LEFT JOIN ( -- parentheses!
SELECT tc.teacher_id, date_trunc('month', s.study_dt) AS mon, s.room_id, count(*) AS ct
FROM studies s
JOIN teacher_contacts tc ON s.teacher_contact_id = tc.id
WHERE s.study_dt >= date_trunc('month', now() - interval '12 month') -- sargable
GROUP BY 1, 2, 3
) s ON s.teacher_id = t.id
AND s.mon = m.mon
AND s.room_id = r.id
ORDER BY 1, 2, 3;
关于您的结束语:
the dataset would be fed to a pivot library ... (could not do this in PG directly)
你有可能使用crosstab()
的双参数形式直接产生你想要的结果并且非常好性能和上面的查询不需要开始。考虑:
- PostgreSQL Crosstab Query