按从 date_start 到 date_end 列的日期范围分组
Group by range of dates from date_start to date_end columns
我有一个 table 具有以下 table 结构:
place_id date_start date_end
2826088480774 2017-09-19 2017-09-20
1898375544837 2017-08-01 2017-08-03
1425929142277 2017-09-23 2017-10-03
1013612281863 2016-10-12 2016-10-14
1795296329731 2016-10-13 2016-10-13
695784701956 2017-09-11 2017-11-02
我想统计每个月每个地方有多少事件(每一行是一个事件)。如果事件日期涉及几个月,则应计算所有受影响的月份。
place_id
可以重复,所以我做了以下查询:
Select place_id, EXTRACT(MONTH FROM date_start) as
month, EXTRACT(YEAR FROM date_start) as year,
COUNT(*) as events
From Table
Group by place_id, year, month
Order by month, year, events desc
所以我得到以下分组 table:
place_id month year events
2826088480774 8 2017 345
1898375544837 8 2017 343
1425929142277 8 2017 344
1013612281863 8 2017 355
1795296329731 8 2017 348
695784701956 8 2017 363
问题是数据仅按 start_date
分组,我不清楚如何按从 date_start
到 date_end
的所有受影响月份分发数据。
您可以使用 sequence
函数生成 date_start
和 date_end
之间的日期,然后分解生成的数组列并分组并像您已经做的那样计数:
df.createOrReplaceTempView("EventsTable")
spark.sql("""
WITH events AS (
SELECT place_id,
explode(event_dates) as event_date
FROM (
SELECT place_id,
sequence(date_start, date_end, interval 1 day) as event_dates
FROM EventsTable
)
)
SELECT place_id,
month(event_date) as month,
year(event_date) as year,
count(*) as events
FROM events
GROUP BY 1, 2, 3
ORDER BY month, year, events desc
""").show()
//+-------------+-----+----+------+
//| place_id|month|year|events|
//+-------------+-----+----+------+
//|1898375544837| 8|2017| 3|
//|695784701956 | 9|2017| 20|
//|1425929142277| 9|2017| 8|
//|2826088480774| 9|2017| 2|
//|1013612281863| 10|2016| 3|
//|1795296329731| 10|2016| 1|
//|695784701956 | 10|2017| 31|
//|1425929142277| 10|2017| 3|
//|695784701956 | 11|2017| 2|
//+-------------+-----+----+------+
我有一个 table 具有以下 table 结构:
place_id date_start date_end
2826088480774 2017-09-19 2017-09-20
1898375544837 2017-08-01 2017-08-03
1425929142277 2017-09-23 2017-10-03
1013612281863 2016-10-12 2016-10-14
1795296329731 2016-10-13 2016-10-13
695784701956 2017-09-11 2017-11-02
我想统计每个月每个地方有多少事件(每一行是一个事件)。如果事件日期涉及几个月,则应计算所有受影响的月份。
place_id
可以重复,所以我做了以下查询:
Select place_id, EXTRACT(MONTH FROM date_start) as
month, EXTRACT(YEAR FROM date_start) as year,
COUNT(*) as events
From Table
Group by place_id, year, month
Order by month, year, events desc
所以我得到以下分组 table:
place_id month year events
2826088480774 8 2017 345
1898375544837 8 2017 343
1425929142277 8 2017 344
1013612281863 8 2017 355
1795296329731 8 2017 348
695784701956 8 2017 363
问题是数据仅按 start_date
分组,我不清楚如何按从 date_start
到 date_end
的所有受影响月份分发数据。
您可以使用 sequence
函数生成 date_start
和 date_end
之间的日期,然后分解生成的数组列并分组并像您已经做的那样计数:
df.createOrReplaceTempView("EventsTable")
spark.sql("""
WITH events AS (
SELECT place_id,
explode(event_dates) as event_date
FROM (
SELECT place_id,
sequence(date_start, date_end, interval 1 day) as event_dates
FROM EventsTable
)
)
SELECT place_id,
month(event_date) as month,
year(event_date) as year,
count(*) as events
FROM events
GROUP BY 1, 2, 3
ORDER BY month, year, events desc
""").show()
//+-------------+-----+----+------+
//| place_id|month|year|events|
//+-------------+-----+----+------+
//|1898375544837| 8|2017| 3|
//|695784701956 | 9|2017| 20|
//|1425929142277| 9|2017| 8|
//|2826088480774| 9|2017| 2|
//|1013612281863| 10|2016| 3|
//|1795296329731| 10|2016| 1|
//|695784701956 | 10|2017| 31|
//|1425929142277| 10|2017| 3|
//|695784701956 | 11|2017| 2|
//+-------------+-----+----+------+