如何根据 months/years 计算出现次数
How to calculate occurrence depending on months/years
我的 table 看起来像这样:
ID | Start | End
1 | 2010-01-02 | 2010-01-04
1 | 2010-01-22 | 2010-01-24
1 | 2011-01-31 | 2011-02-02
2 | 2012-05-02 | 2012-05-08
3 | 2013-01-02 | 2013-01-03
4 | 2010-09-15 | 2010-09-20
4 | 2010-09-30 | 2010-10-05
我正在寻找一种方法来计算每个 ID 在一年中每个月的出现次数。
但重要的是,如果某些记录的开始日期与结束日期(当然是同一年)相比在下个月,则应该计算两个月的发生次数 [例如第三行的ID 1有这样的情况。所以在这种情况下,这个 ID 的出现次数应该是 1 月 +1 和 2 月 +1。
所以我想以这种方式拥有它:
Year | Month | Id | Occurrence
2010 | 01 | 1 | 2
2010 | 09 | 4 | 2
2010 | 10 | 4 | 1
2011 | 01 | 1 | 1
2011 | 02 | 1 | 1
2012 | 05 | 2 | 1
2013 | 01 | 3 | 1
我现在只创建了这个...
CREATE TABLE IF NOT EXISTS counts AS
(SELECT
id,
YEAR (CAST(Start AS DATE)) AS Year_St,
MONTH (CAST(Start AS DATE)) AS Month_St,
YEAR (CAST(End AS DATE)) AS Year_End,
MONTH (CAST(End AS DATE)) AS Month_End
FROM source)
而且我不知道如何进一步推进。我很感激你的帮助。
我正在使用 Spark SQL.
尝试以下策略来实现这一点:
注:
- 我创建了几个中级 table。如果您愿意,可以根据权限使用子查询或 CTE
- 你说的
2 scenarios
我已经处理了(算1次还是2次)
查询:
首先,创建一个table和flags
来判断start
和end
日期是否在同一年同月(1表示是,2表示否):
/* Creating a table with flags whether to count the occurrences once or twice */
CREATE TABLE flagged as
(
SELECT *,
CASE
WHEN Year_st = Year_end and Month_st = Month_end then 1
WHEN Year_st = Year_end and Month_st <> Month_end then 2
Else 0
end as flag
FROM
(
SELECT
id,
YEAR (CAST(Start AS DATE)) AS Year_St,
MONTH (CAST(Start AS DATE)) AS Month_St,
YEAR (CAST(End AS DATE)) AS Year_End,
MONTH (CAST(End AS DATE)) AS Month_End
FROM source
) as calc
)
现在上面 table 中的标志将有 1 如果开始和结束的年份和月份相同,如果月份不同则为 2。如果你有更多的场景,你可以有更多类别的标志。
其次,计算flag 1
的出现次数。我们知道 year
和 month
对于标志 1 是相同的,我们可以选择其中任何一个。我参加了 start
:
/* Counting occurrences only for flag 1 */
CREATE TABLE flg1 as (
SELECT distinct id, year_st, month_st, count(*) as occurrence
FROM flagged
where flag=1
GROUP BY id, year_st, month_st
)
同样,计算 flag 2
的出现次数。由于两个日期的月份不同,我们可以 UNION
它们在计算之前得到同一列中的两个日期:
/* Counting occurrences only for flag 2 */
CREATE TABLE flg2 as
(
SELECT distinct id, year_dt, month_dt, count(*) as occurrence
FROM
(
select ID, year_st as year_dt, month_st as month_dt FROM flagged where flag=2
UNION
SELECT ID, year_end as year_dt, month_end as month_dt FROM flagged where flag=2
) as unioned
GROUP BY id, year_dt, month_dt
)
最后,我们只需要对两个标志出现的次数求和即可。请注意,我们在这里使用 UNION ALL
来组合两个 table。这非常重要,因为我们还需要计算重复项:
/* UNIONING both the final tables and summing the occurrences */
SELECT distinct year, month, id, SUM(occurrence) as occurrence
FROM
(
SELECT distinct id, year_st as year, month_st as month, occurrence
FROM flg1
UNION ALL
SELECT distinct id, year_dt as year, month_dt as month, occurrence
FROM flg2
) as fin_unioned
GROUP BY id, year, month
ORDER BY year, month, id, occurrence desc
上述查询的输出将是您预期的输出。我知道这不是优化的,但它工作得很好。如果我遇到优化策略,我会更新。有问题请评论。
db<>fiddle link here
不确定这是否适用于 Spark SQL。
但如果范围不超过 1 个月,则只需通过 UNION ALL
.
将额外的部分添加到计数中
额外的是那些结束的月份比开始的月份多。
SELECT YearOcc, MonthOcc, Id
, COUNT(*) as Occurrence
FROM
(
SELECT Id
, YEAR(CAST(Start AS DATE)) as YearOcc
, MONTH(CAST(Start AS DATE)) as MonthOcc
FROM source
UNION ALL
SELECT Id
, YEAR(CAST(End AS DATE)) as YearOcc
, MONTH(CAST(End AS DATE)) as MonthOcc
FROM source
WHERE MONTH(CAST(Start AS DATE)) < MONTH(CAST(End AS DATE))
) q
GROUP BY YearOcc, MonthOcc, Id
ORDER BY YearOcc, MonthOcc, Id
YearOcc | MonthOcc | Id | Occurrence
------: | -------: | -: | ---------:
2010 | 1 | 1 | 2
2010 | 9 | 4 | 2
2010 | 10 | 4 | 1
2011 | 1 | 1 | 1
2011 | 2 | 1 | 1
2012 | 5 | 2 | 1
2013 | 1 | 3 | 1
db<>fiddle here
我的 table 看起来像这样:
ID | Start | End
1 | 2010-01-02 | 2010-01-04
1 | 2010-01-22 | 2010-01-24
1 | 2011-01-31 | 2011-02-02
2 | 2012-05-02 | 2012-05-08
3 | 2013-01-02 | 2013-01-03
4 | 2010-09-15 | 2010-09-20
4 | 2010-09-30 | 2010-10-05
我正在寻找一种方法来计算每个 ID 在一年中每个月的出现次数。 但重要的是,如果某些记录的开始日期与结束日期(当然是同一年)相比在下个月,则应该计算两个月的发生次数 [例如第三行的ID 1有这样的情况。所以在这种情况下,这个 ID 的出现次数应该是 1 月 +1 和 2 月 +1。
所以我想以这种方式拥有它:
Year | Month | Id | Occurrence
2010 | 01 | 1 | 2
2010 | 09 | 4 | 2
2010 | 10 | 4 | 1
2011 | 01 | 1 | 1
2011 | 02 | 1 | 1
2012 | 05 | 2 | 1
2013 | 01 | 3 | 1
我现在只创建了这个...
CREATE TABLE IF NOT EXISTS counts AS
(SELECT
id,
YEAR (CAST(Start AS DATE)) AS Year_St,
MONTH (CAST(Start AS DATE)) AS Month_St,
YEAR (CAST(End AS DATE)) AS Year_End,
MONTH (CAST(End AS DATE)) AS Month_End
FROM source)
而且我不知道如何进一步推进。我很感激你的帮助。 我正在使用 Spark SQL.
尝试以下策略来实现这一点:
注:
- 我创建了几个中级 table。如果您愿意,可以根据权限使用子查询或 CTE
- 你说的
2 scenarios
我已经处理了(算1次还是2次)
查询:
首先,创建一个table和flags
来判断start
和end
日期是否在同一年同月(1表示是,2表示否):
/* Creating a table with flags whether to count the occurrences once or twice */
CREATE TABLE flagged as
(
SELECT *,
CASE
WHEN Year_st = Year_end and Month_st = Month_end then 1
WHEN Year_st = Year_end and Month_st <> Month_end then 2
Else 0
end as flag
FROM
(
SELECT
id,
YEAR (CAST(Start AS DATE)) AS Year_St,
MONTH (CAST(Start AS DATE)) AS Month_St,
YEAR (CAST(End AS DATE)) AS Year_End,
MONTH (CAST(End AS DATE)) AS Month_End
FROM source
) as calc
)
现在上面 table 中的标志将有 1 如果开始和结束的年份和月份相同,如果月份不同则为 2。如果你有更多的场景,你可以有更多类别的标志。
其次,计算flag 1
的出现次数。我们知道 year
和 month
对于标志 1 是相同的,我们可以选择其中任何一个。我参加了 start
:
/* Counting occurrences only for flag 1 */
CREATE TABLE flg1 as (
SELECT distinct id, year_st, month_st, count(*) as occurrence
FROM flagged
where flag=1
GROUP BY id, year_st, month_st
)
同样,计算 flag 2
的出现次数。由于两个日期的月份不同,我们可以 UNION
它们在计算之前得到同一列中的两个日期:
/* Counting occurrences only for flag 2 */
CREATE TABLE flg2 as
(
SELECT distinct id, year_dt, month_dt, count(*) as occurrence
FROM
(
select ID, year_st as year_dt, month_st as month_dt FROM flagged where flag=2
UNION
SELECT ID, year_end as year_dt, month_end as month_dt FROM flagged where flag=2
) as unioned
GROUP BY id, year_dt, month_dt
)
最后,我们只需要对两个标志出现的次数求和即可。请注意,我们在这里使用 UNION ALL
来组合两个 table。这非常重要,因为我们还需要计算重复项:
/* UNIONING both the final tables and summing the occurrences */
SELECT distinct year, month, id, SUM(occurrence) as occurrence
FROM
(
SELECT distinct id, year_st as year, month_st as month, occurrence
FROM flg1
UNION ALL
SELECT distinct id, year_dt as year, month_dt as month, occurrence
FROM flg2
) as fin_unioned
GROUP BY id, year, month
ORDER BY year, month, id, occurrence desc
上述查询的输出将是您预期的输出。我知道这不是优化的,但它工作得很好。如果我遇到优化策略,我会更新。有问题请评论。
db<>fiddle link here
不确定这是否适用于 Spark SQL。
但如果范围不超过 1 个月,则只需通过 UNION ALL
.
将额外的部分添加到计数中
额外的是那些结束的月份比开始的月份多。
SELECT YearOcc, MonthOcc, Id
, COUNT(*) as Occurrence
FROM
(
SELECT Id
, YEAR(CAST(Start AS DATE)) as YearOcc
, MONTH(CAST(Start AS DATE)) as MonthOcc
FROM source
UNION ALL
SELECT Id
, YEAR(CAST(End AS DATE)) as YearOcc
, MONTH(CAST(End AS DATE)) as MonthOcc
FROM source
WHERE MONTH(CAST(Start AS DATE)) < MONTH(CAST(End AS DATE))
) q
GROUP BY YearOcc, MonthOcc, Id
ORDER BY YearOcc, MonthOcc, Id
YearOcc | MonthOcc | Id | Occurrence ------: | -------: | -: | ---------: 2010 | 1 | 1 | 2 2010 | 9 | 4 | 2 2010 | 10 | 4 | 1 2011 | 1 | 1 | 1 2011 | 2 | 1 | 1 2012 | 5 | 2 | 1 2013 | 1 | 3 | 1
db<>fiddle here