如何根据 months/years 计算出现次数

How to calculate occurrence depending on months/years

我的 table 看起来像这样:

ID | Start      | End
1  | 2010-01-02 | 2010-01-04
1  | 2010-01-22 | 2010-01-24
1  | 2011-01-31 | 2011-02-02
2  | 2012-05-02 | 2012-05-08
3  | 2013-01-02 | 2013-01-03
4  | 2010-09-15 | 2010-09-20
4  | 2010-09-30 | 2010-10-05

我正在寻找一种方法来计算每个 ID 在一年中每个月的出现次数。 但重要的是,如果某些记录的开始日期与结束日期(当然是同一年)相比在下个月,则应该计算两个月的发生次数 [例如第三行的ID 1有这样的情况。所以在这种情况下,这个 ID 的出现次数应该是 1 月 +1 和 2 月 +1。

所以我想以这种方式拥有它:

Year  | Month | Id | Occurrence
2010  | 01    | 1  | 2
2010  | 09    | 4  | 2
2010  | 10    | 4  | 1
2011  | 01    | 1  | 1
2011  | 02    | 1  | 1
2012  | 05    | 2  | 1
2013  | 01    | 3  | 1

我现在只创建了这个...

    CREATE TABLE IF NOT EXISTS counts AS
    (SELECT 
    id, 
    YEAR (CAST(Start AS DATE)) AS Year_St,
    MONTH (CAST(Start AS DATE)) AS Month_St,
    YEAR (CAST(End AS DATE)) AS Year_End,
    MONTH (CAST(End AS DATE)) AS Month_End
    FROM source)

而且我不知道如何进一步推进。我很感激你的帮助。 我正在使用 Spark SQL.

尝试以下策略来实现这一点:

注:

  1. 我创建了几个中级 table。如果您愿意,可以根据权限使用子查询或 CTE
  2. 你说的2 scenarios我已经处理了(算1次还是2次)

查询:

首先,创建一个table和flags来判断startend日期是否在同一年同月(1表示是,2表示否):

/* Creating a table with flags whether to count the occurrences once or twice */
CREATE TABLE flagged as 
(
  SELECT *, 
  CASE
      WHEN Year_st = Year_end and Month_st = Month_end then 1
      WHEN Year_st = Year_end and Month_st <> Month_end then 2
      Else 0
  end as flag
  FROM
   (
    SELECT 
     id, 
     YEAR (CAST(Start AS DATE)) AS Year_St,
     MONTH (CAST(Start AS DATE)) AS Month_St,
     YEAR (CAST(End AS DATE)) AS Year_End,
     MONTH (CAST(End AS DATE)) AS Month_End
     FROM source
   ) as calc
)

现在上面 table 中的标志将有 1 如果开始和结束的年份和月份相同,如果月份不同则为 2。如果你有更多的场景,你可以有更多类别的标志。

其次,计算flag 1的出现次数。我们知道 yearmonth 对于标志 1 是相同的,我们可以选择其中任何一个。我参加了 start:

/* Counting occurrences only for flag 1 */

CREATE TABLE flg1 as (
SELECT distinct id, year_st, month_st, count(*) as occurrence
FROM flagged
where flag=1
GROUP BY id, year_st, month_st
)

同样,计算 flag 2 的出现次数。由于两个日期的月份不同,我们可以 UNION 它们在计算之前得到同一列中的两个日期:

/* Counting occurrences only for flag 2 */

CREATE TABLE flg2 as 
(
 SELECT distinct id, year_dt, month_dt, count(*) as occurrence
 FROM 
  (
  select ID, year_st as year_dt, month_st as month_dt FROM flagged where flag=2
  UNION
  SELECT ID, year_end as year_dt, month_end as month_dt FROM flagged where flag=2
  ) as unioned
 GROUP BY id, year_dt, month_dt
)

最后,我们只需要对两个标志出现的次数求和即可。请注意,我们在这里使用 UNION ALL 来组合两个 table。这非常重要,因为我们还需要计算重复项:

/* UNIONING both the final tables and summing the occurrences */

SELECT distinct year, month, id, SUM(occurrence) as occurrence
FROM
 (
  SELECT distinct id, year_st as year, month_st as month, occurrence
  FROM flg1
  
  UNION ALL
  
  SELECT distinct id, year_dt as year, month_dt as month, occurrence
  FROM flg2
 ) as fin_unioned

GROUP BY id, year, month
ORDER BY year, month, id, occurrence desc

上述查询的输出将是您预期的输出。我知道这不是优化的,但它工作得很好。如果我遇到优化策略,我会更新。有问题请评论。

db<>fiddle link here

不确定这是否适用于 Spark SQL。

但如果范围不超过 1 个月,则只需通过 UNION ALL.
将额外的部分添加到计数中 额外的是那些结束的月份比开始的月份多。

SELECT YearOcc, MonthOcc, Id
, COUNT(*) as Occurrence
FROM 
(
  SELECT Id
  , YEAR(CAST(Start AS DATE)) as YearOcc
  , MONTH(CAST(Start AS DATE)) as MonthOcc
  FROM source
  
  UNION ALL

  SELECT Id
  , YEAR(CAST(End AS DATE)) as YearOcc
  , MONTH(CAST(End AS DATE)) as MonthOcc
  FROM source
  WHERE MONTH(CAST(Start AS DATE)) < MONTH(CAST(End AS DATE))
) q
GROUP BY YearOcc, MonthOcc, Id
ORDER BY YearOcc, MonthOcc, Id
YearOcc | MonthOcc | Id | Occurrence
------: | -------: | -: | ---------:
   2010 |        1 |  1 |          2
   2010 |        9 |  4 |          2
   2010 |       10 |  4 |          1
   2011 |        1 |  1 |          1
   2011 |        2 |  1 |          1
   2012 |        5 |  2 |          1
   2013 |        1 |  3 |          1

db<>fiddle here