按从 date_start 到 date_end 列的日期范围分组

Question

我有一个 table 具有以下 table 结构：

place_id            date_start       date_end
2826088480774       2017-09-19       2017-09-20
1898375544837       2017-08-01       2017-08-03
1425929142277       2017-09-23       2017-10-03
1013612281863       2016-10-12       2016-10-14
1795296329731       2016-10-13       2016-10-13
695784701956        2017-09-11       2017-11-02

我想统计每个月每个地方有多少事件（每一行是一个事件）。如果事件日期涉及几个月，则应计算所有受影响的月份。

place_id可以重复，所以我做了以下查询：

Select place_id, EXTRACT(MONTH FROM date_start) as 
month, EXTRACT(YEAR FROM date_start) as year, 
COUNT(*) as events
From Table
Group by place_id, year, month
Order by month, year, events desc

所以我得到以下分组 table:

place_id         month      year   events
2826088480774       8       2017     345
1898375544837       8       2017     343
1425929142277       8       2017     344
1013612281863       8       2017     355
1795296329731       8       2017     348
695784701956        8       2017     363

问题是数据仅按 start_date 分组，我不清楚如何按从 date_start 到 date_end 的所有受影响月份分发数据。

Answer 1

您可以使用 sequence 函数生成 date_start 和 date_end 之间的日期，然后分解生成的数组列并分组并像您已经做的那样计数：

df.createOrReplaceTempView("EventsTable")

spark.sql("""
    WITH events AS (
        SELECT  place_id, 
                explode(event_dates) as event_date
        FROM    (
            SELECT  place_id, 
                    sequence(date_start, date_end, interval 1 day) as event_dates
            FROM    EventsTable
        )
    )
    
    SELECT  place_id, 
            month(event_date) as month, 
            year(event_date)  as year,
            count(*)          as events
    FROM    events
    GROUP BY 1, 2, 3
    ORDER BY month, year, events desc
""").show()

//+-------------+-----+----+------+
//|     place_id|month|year|events|
//+-------------+-----+----+------+
//|1898375544837|    8|2017|     3|
//|695784701956 |    9|2017|    20|
//|1425929142277|    9|2017|     8|
//|2826088480774|    9|2017|     2|
//|1013612281863|   10|2016|     3|
//|1795296329731|   10|2016|     1|
//|695784701956 |   10|2017|    31|
//|1425929142277|   10|2017|     3|
//|695784701956 |   11|2017|     2|
//+-------------+-----+----+------+

按从 date_start 到 date_end 列的日期范围分组

Group by range of dates from date_start to date_end columns

sql

apache-spark

apache-spark-sql

databricks