如何根据 months/years 计算出现次数

Question

我的 table 看起来像这样：

ID | Start      | End
1  | 2010-01-02 | 2010-01-04
1  | 2010-01-22 | 2010-01-24
1  | 2011-01-31 | 2011-02-02
2  | 2012-05-02 | 2012-05-08
3  | 2013-01-02 | 2013-01-03
4  | 2010-09-15 | 2010-09-20
4  | 2010-09-30 | 2010-10-05

我正在寻找一种方法来计算每个 ID 在一年中每个月的出现次数。但重要的是，如果某些记录的开始日期与结束日期（当然是同一年）相比在下个月，则应该计算两个月的发生次数 [例如第三行的ID 1有这样的情况。所以在这种情况下，这个 ID 的出现次数应该是 1 月 +1 和 2 月 +1。

所以我想以这种方式拥有它：

Year  | Month | Id | Occurrence
2010  | 01    | 1  | 2
2010  | 09    | 4  | 2
2010  | 10    | 4  | 1
2011  | 01    | 1  | 1
2011  | 02    | 1  | 1
2012  | 05    | 2  | 1
2013  | 01    | 3  | 1

我现在只创建了这个...

    CREATE TABLE IF NOT EXISTS counts AS
    (SELECT 
    id, 
    YEAR (CAST(Start AS DATE)) AS Year_St,
    MONTH (CAST(Start AS DATE)) AS Month_St,
    YEAR (CAST(End AS DATE)) AS Year_End,
    MONTH (CAST(End AS DATE)) AS Month_End
    FROM source)

而且我不知道如何进一步推进。我很感激你的帮助。我正在使用 Spark SQL.

Answer 1

尝试以下策略来实现这一点：

注：

我创建了几个中级 table。如果您愿意，可以根据权限使用子查询或 CTE
你说的2 scenarios我已经处理了（算1次还是2次）

查询：

首先，创建一个table和flags来判断start和end日期是否在同一年同月（1表示是，2表示否):

/* Creating a table with flags whether to count the occurrences once or twice */
CREATE TABLE flagged as 
(
  SELECT *, 
  CASE
      WHEN Year_st = Year_end and Month_st = Month_end then 1
      WHEN Year_st = Year_end and Month_st <> Month_end then 2
      Else 0
  end as flag
  FROM
   (
    SELECT 
     id, 
     YEAR (CAST(Start AS DATE)) AS Year_St,
     MONTH (CAST(Start AS DATE)) AS Month_St,
     YEAR (CAST(End AS DATE)) AS Year_End,
     MONTH (CAST(End AS DATE)) AS Month_End
     FROM source
   ) as calc
)

现在上面 table 中的标志将有 1 如果开始和结束的年份和月份相同，如果月份不同则为 2。如果你有更多的场景，你可以有更多类别的标志。

其次，计算flag 1的出现次数。我们知道 year 和 month 对于标志 1 是相同的，我们可以选择其中任何一个。我参加了 start:

/* Counting occurrences only for flag 1 */

CREATE TABLE flg1 as (
SELECT distinct id, year_st, month_st, count(*) as occurrence
FROM flagged
where flag=1
GROUP BY id, year_st, month_st
)

同样，计算 flag 2 的出现次数。由于两个日期的月份不同，我们可以 UNION 它们在计算之前得到同一列中的两个日期：

/* Counting occurrences only for flag 2 */

CREATE TABLE flg2 as 
(
 SELECT distinct id, year_dt, month_dt, count(*) as occurrence
 FROM 
  (
  select ID, year_st as year_dt, month_st as month_dt FROM flagged where flag=2
  UNION
  SELECT ID, year_end as year_dt, month_end as month_dt FROM flagged where flag=2
  ) as unioned
 GROUP BY id, year_dt, month_dt
)

最后，我们只需要对两个标志出现的次数求和即可。请注意，我们在这里使用 UNION ALL 来组合两个 table。这非常重要，因为我们还需要计算重复项：

/* UNIONING both the final tables and summing the occurrences */

SELECT distinct year, month, id, SUM(occurrence) as occurrence
FROM
 (
  SELECT distinct id, year_st as year, month_st as month, occurrence
  FROM flg1
  
  UNION ALL
  
  SELECT distinct id, year_dt as year, month_dt as month, occurrence
  FROM flg2
 ) as fin_unioned

GROUP BY id, year, month
ORDER BY year, month, id, occurrence desc

上述查询的输出将是您预期的输出。我知道这不是优化的，但它工作得很好。如果我遇到优化策略，我会更新。有问题请评论。

db<>fiddle link here

Answer 2

不确定这是否适用于 Spark SQL。

但如果范围不超过 1 个月，则只需通过 UNION ALL.
将额外的部分添加到计数中额外的是那些结束的月份比开始的月份多。

SELECT YearOcc, MonthOcc, Id
, COUNT(*) as Occurrence
FROM 
(
  SELECT Id
  , YEAR(CAST(Start AS DATE)) as YearOcc
  , MONTH(CAST(Start AS DATE)) as MonthOcc
  FROM source
  
  UNION ALL

  SELECT Id
  , YEAR(CAST(End AS DATE)) as YearOcc
  , MONTH(CAST(End AS DATE)) as MonthOcc
  FROM source
  WHERE MONTH(CAST(Start AS DATE)) < MONTH(CAST(End AS DATE))
) q
GROUP BY YearOcc, MonthOcc, Id
ORDER BY YearOcc, MonthOcc, Id

YearOcc | MonthOcc | Id | Occurrence
------: | -------: | -: | ---------:
   2010 |        1 |  1 |          2
   2010 |        9 |  4 |          2
   2010 |       10 |  4 |          1
   2011 |        1 |  1 |          1
   2011 |        2 |  1 |          1
   2012 |        5 |  2 |          1
   2013 |        1 |  3 |          1

db<>fiddle here

如何根据 months/years 计算出现次数

How to calculate occurrence depending on months/years

sql

apache-spark-sql