比较每月数据,同时保留每日粒度
Compare monthly data while retaining daily granularity
我有以下数据,其中包含一组 ID 的每月目标。目标是针对每个 id,针对 2020 年的每个月。名为 targets
的 table。 month
列表示一年中的月份。
+-------+-------+----+--------+
| month | name | id | target |
+-------+-------+----+--------+
| 1 | Comp1 | 1 | 6000 |
+-------+-------+----+--------+
| 2 | Comp1 | 1 | 6000 |
+-------+-------+----+--------+
| 3 | Comp1 | 1 | 6000 |
+-------+-------+----+--------+
| 1 | Comp2 | 2 | 6000 |
+-------+-------+----+--------+
| 2 | Comp2 | 2 | 6000 |
+-------+-------+----+--------+
| 3 | Comp2 | 2 | 6000 |
+-------+-------+----+--------+
| 1 | Comp3 | 3 | 6000 |
+-------+-------+----+--------+
| 2 | Comp3 | 3 | 6000 |
+-------+-------+----+--------+
| 3 | Comp3 | 3 | 6000 |
+-------+-------+----+--------+
| 1 | Comp4 | 4 | 6000 |
+-------+-------+----+--------+
| 2 | Comp4 | 4 | 6000 |
+-------+-------+----+--------+
| 3 | Comp4 | 4 | 6000 |
+-------+-------+----+--------+
然后我有第二个 table,其中包含一组 ID 的每日数据,并且每天更新。在我的实际数据集中,我有从 2019-01-01 到今天的数据。
+------------+-------+----+--------+--------+
| yyyy_mm_dd | name | id | actual | region |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp1 | 1 | 1000 | LATAM |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp1 | 1 | 0 | EU |
+-------------------------------------------+
| 2019-01-02 | Comp1 | 1 | 2000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp1 | 1 | 4000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp2 | 2 | 1000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp2 | 2 | 2000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp2 | 2 | 3000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp3 | 3 | 1000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp3 | 3 | 2000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp3 | 3 | 8000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp4 | 4 | 1000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp4 | 4 | 2000 | EU |
+------------+-------+----+--------+--------+
| 2019-02-03 | Comp4 | 4 | 3000 | EU |
+------------+-------+----+--------+--------+
基于以上两个 table,我想创建第三个 table 并添加一些额外的逻辑。最后,我想引入一个名为 payment
的新专栏。除非公司已超过其月度目标,否则此列应始终为 0。如果每月目标是 met/passed,那么支出应该是 sum actual for that month - monthly target for that month * 1%
。
输出数据可能如下所示:
+------------+-------+----+--------+--------+
| yyyy_mm_dd | name | id | actual | payout |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp1 | 1 | 1000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp1 | 1 | 2000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp1 | 1 | 4000 | 10 |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp2 | 2 | 1000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp2 | 2 | 2000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp2 | 2 | 3000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp3 | 3 | 1000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp3 | 3 | 2000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp3 | 3 | 8000 | 50 |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp4 | 4 | 1000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp4 | 4 | 2000 | 0 |
+------------+-------+----+--------+--------+
| 2020-02-03 | Comp4 | 4 | 3000 | 0 |
+------------+-------+----+--------+--------+
上述数据集中的所有 names/ids 每月 target
为 6000。因此当 name/id 在该月超过该目标时,应该只有 payout
. Comp1 和 Comp3 都在 1 月的第三天超过了月度目标,因此他们从那天开始直到月末获得支出。然后在 2 月重置,因为这是一个新的月份,有一个新的目标,随着月份的进展,我们将获得新的每日数据。
我尝试过的:
SELECT
agg.yyyy_mm_dd,
agg.name,
agg.id,
CASE WHEN agg.actual >= targets.target THEN ((agg.actual-targets.target)/100) * 1 ELSE 0 END AS payout
FROM(
SELECT
sum(x.actual) AS actual,
x.yyyy_mm_dd,
x.name,
x.id
FROM(
SELECT
yyyy_mm_dd,
name,
id,
cast(actual as int) as actual
FROM
schema.daily_data
WHERE
yyyy_mm_dd >= '2020-01-01' AND (name = 'Comp1' OR name = 'Comp2')
) x
GROUP BY
2,3,4
) agg
INNER JOIN(
SELECT
id,
month,
target
FROM
schema.targets
) targets ON targets.id = agg.id
GROUP BY
1,2,3,4
但是,以上每个 name
输出多行。这是每日 table 每天多次使用同一家公司的结果(预期)。我以为我的分组会处理这个问题。另外,我认为这不是最简单的整体解决方案,我可能想多了/可以更有效地完成。
您似乎想要将每个公司和每个月的 actua
的累计总和与 target
进行比较。您可以使用连接和 window 函数来执行此操作:
select
d.yyyy_mm_dd,
case when sum(d.actual) over(partition by d.name, t.month order by d.yyyy_mm_dd) > t.target
then (sum(d.actual) over(partition by d.name, t.month order by d.yyyy_mm_dd) - t.target) / 100.0
else 0
end payout
from schema.targets t
inner join schema.daily_data d
on month(d.yyyy_mm_dd) = t.month
and d.name = t.name
where
d.yyyy_mm_dd >= '2020-01-01'
and d.name in ('Comp1', 'Comp2')
您对 运行(部分)实际总和的请求很容易通过 window 函数解决。不幸的是我不使用 Hive,所以这是我的 Postgres 工作解决方案
with t (month, name, id, target) as (values
(1 , 'Comp1', 1 , 6000 ),
(2 , 'Comp1', 1 , 6000 ),
(3 , 'Comp1', 1 , 6000 ),
(1 , 'Comp2', 2 , 6000 ),
(2 , 'Comp2', 2 , 6000 ),
(3 , 'Comp2', 2 , 6000 ),
(1 , 'Comp3', 3 , 6000 ),
(2 , 'Comp3', 3 , 6000 ),
(3 , 'Comp3', 3 , 6000 ),
(1 , 'Comp4', 4 , 6000 ),
(2 , 'Comp4', 4 , 6000 ),
(3 , 'Comp4', 4 , 6000 )
), d (yyyy_mm_dd, name, id, actual, region) as (values
( date '2019-01-01' , 'Comp1' , 1 , 1000 , 'LATAM' ),
( date '2019-01-01' , 'Comp1' , 1 , 0 , 'EU' ),
( date '2019-01-02' , 'Comp1' , 1 , 2000 , 'EU' ),
( date '2019-01-03' , 'Comp1' , 1 , 4000 , 'EU' ),
( date '2019-01-01' , 'Comp2' , 2 , 1000 , 'EU' ),
( date '2019-01-02' , 'Comp2' , 2 , 2000 , 'EU' ),
( date '2019-01-03' , 'Comp2' , 2 , 3000 , 'EU' ),
( date '2019-01-01' , 'Comp3' , 3 , 1000 , 'EU' ),
( date '2019-01-02' , 'Comp3' , 3 , 2000 , 'EU' ),
( date '2019-01-03' , 'Comp3' , 3 , 8000 , 'EU' ),
( date '2019-01-01' , 'Comp4' , 4 , 1000 , 'EU' ),
( date '2019-01-02' , 'Comp4' , 4 , 2000 , 'EU' ),
( date '2019-02-03' , 'Comp4' , 4 , 3000 , 'EU' )
)
select dr.yyyy_mm_dd, dr.name, dr.id, dr.actual,
case when dr.running_sum < t.target then 0 else (dr.running_sum - t.target) / 100 end as payment
from t
join (
select dg.*, sum(actual) over (partition by name order by yyyy_mm_dd) as running_sum
from (
select yyyy_mm_dd, name, id, sum(actual) as actual
from d
group by yyyy_mm_dd, name, id
) dg
) dr on dr.name = t.name
and month(dr.yyyy_mm_dd) = t.month -- edited to hive equivalent of postgres' extract(month from dr.yyyy_mm_dd) = t.month
从日期中提取月份可能有不同的方法,但我希望你能理解。
另一种选择是使用开窗 SUM
函数创建 运行 总数,然后在 CASE
语句中使用它来获取列值。
SELECT d.yyyy_mm_dd
,d.name
,d.id
,d.actual
,CASE
WHEN
SUM(d.actual)
OVER (PARTITION BY d.id ORDER BY d.yyyy_mm_dd ROWS UNBOUNDED PRECEDING) <= t.target
THEN 0
ELSE
(
SUM(d.actual)
OVER (PARTITION BY d.id ORDER BY d.yyyy_mm_dd ROWS UNBOUNDED PRECEDING) - t.target
) * 0.01
END AS payout
FROM dailies AS d
JOIN targets AS t
ON d.month = MONTH(d.yyyy_mm_dd)
AND d.id = d.id;
我不是 100% 确定 Hive 语法,但这非常接近。具体来说,ROWS UNBOUNDED PRECEDING
可能还不够。您可能需要其中的 FOLLOWING
子句才能正确计算总数。
我想我现在有了一个可行的解决方案。下面给出了预期的输出。它可能会被优化一点,因为它不是最快的。
SELECT
x.yyyy_mm_dd,
x.id,
x.name,
x.actual,
x.target,
x.actual_to_date,
CASE WHEN x.actual_to_date > x.target THEN ((x.actual_to_date - x.target) /100) * 1 ELSE 0 END AS payout
FROM(
SELECT
daily.yyyy_mm_dd,
daily.id,
daily.name,
daily.actual,
t.target,
SUM(daily.actual) OVER (PARTITION BY MONTH(daily.yyyy_mm_dd), daily.id ORDER BY daily.yyyy_mm_dd RANGE UNBOUNDED PRECEDING) AS actual_to_date
FROM(
SELECT
yyyy_mm_dd,
id,
name,
sum(cast(actual as int)) as actual
FROM
daily_data_table
WHERE
yyyy_mm_dd >= '2020-01-01'
GROUP BY
1,2,3
) daily
INNER JOIN
monthly_target_table t
ON t.id = daily.id AND t.month = month(daily.yyyy_mm_dd)
WHERE
daily.name = 'Comp1'
) x
我有以下数据,其中包含一组 ID 的每月目标。目标是针对每个 id,针对 2020 年的每个月。名为 targets
的 table。 month
列表示一年中的月份。
+-------+-------+----+--------+
| month | name | id | target |
+-------+-------+----+--------+
| 1 | Comp1 | 1 | 6000 |
+-------+-------+----+--------+
| 2 | Comp1 | 1 | 6000 |
+-------+-------+----+--------+
| 3 | Comp1 | 1 | 6000 |
+-------+-------+----+--------+
| 1 | Comp2 | 2 | 6000 |
+-------+-------+----+--------+
| 2 | Comp2 | 2 | 6000 |
+-------+-------+----+--------+
| 3 | Comp2 | 2 | 6000 |
+-------+-------+----+--------+
| 1 | Comp3 | 3 | 6000 |
+-------+-------+----+--------+
| 2 | Comp3 | 3 | 6000 |
+-------+-------+----+--------+
| 3 | Comp3 | 3 | 6000 |
+-------+-------+----+--------+
| 1 | Comp4 | 4 | 6000 |
+-------+-------+----+--------+
| 2 | Comp4 | 4 | 6000 |
+-------+-------+----+--------+
| 3 | Comp4 | 4 | 6000 |
+-------+-------+----+--------+
然后我有第二个 table,其中包含一组 ID 的每日数据,并且每天更新。在我的实际数据集中,我有从 2019-01-01 到今天的数据。
+------------+-------+----+--------+--------+
| yyyy_mm_dd | name | id | actual | region |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp1 | 1 | 1000 | LATAM |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp1 | 1 | 0 | EU |
+-------------------------------------------+
| 2019-01-02 | Comp1 | 1 | 2000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp1 | 1 | 4000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp2 | 2 | 1000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp2 | 2 | 2000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp2 | 2 | 3000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp3 | 3 | 1000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp3 | 3 | 2000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp3 | 3 | 8000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp4 | 4 | 1000 | EU |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp4 | 4 | 2000 | EU |
+------------+-------+----+--------+--------+
| 2019-02-03 | Comp4 | 4 | 3000 | EU |
+------------+-------+----+--------+--------+
基于以上两个 table,我想创建第三个 table 并添加一些额外的逻辑。最后,我想引入一个名为 payment
的新专栏。除非公司已超过其月度目标,否则此列应始终为 0。如果每月目标是 met/passed,那么支出应该是 sum actual for that month - monthly target for that month * 1%
。
输出数据可能如下所示:
+------------+-------+----+--------+--------+
| yyyy_mm_dd | name | id | actual | payout |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp1 | 1 | 1000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp1 | 1 | 2000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp1 | 1 | 4000 | 10 |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp2 | 2 | 1000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp2 | 2 | 2000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp2 | 2 | 3000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp3 | 3 | 1000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp3 | 3 | 2000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp3 | 3 | 8000 | 50 |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp4 | 4 | 1000 | 0 |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp4 | 4 | 2000 | 0 |
+------------+-------+----+--------+--------+
| 2020-02-03 | Comp4 | 4 | 3000 | 0 |
+------------+-------+----+--------+--------+
上述数据集中的所有 names/ids 每月 target
为 6000。因此当 name/id 在该月超过该目标时,应该只有 payout
. Comp1 和 Comp3 都在 1 月的第三天超过了月度目标,因此他们从那天开始直到月末获得支出。然后在 2 月重置,因为这是一个新的月份,有一个新的目标,随着月份的进展,我们将获得新的每日数据。
我尝试过的:
SELECT
agg.yyyy_mm_dd,
agg.name,
agg.id,
CASE WHEN agg.actual >= targets.target THEN ((agg.actual-targets.target)/100) * 1 ELSE 0 END AS payout
FROM(
SELECT
sum(x.actual) AS actual,
x.yyyy_mm_dd,
x.name,
x.id
FROM(
SELECT
yyyy_mm_dd,
name,
id,
cast(actual as int) as actual
FROM
schema.daily_data
WHERE
yyyy_mm_dd >= '2020-01-01' AND (name = 'Comp1' OR name = 'Comp2')
) x
GROUP BY
2,3,4
) agg
INNER JOIN(
SELECT
id,
month,
target
FROM
schema.targets
) targets ON targets.id = agg.id
GROUP BY
1,2,3,4
但是,以上每个 name
输出多行。这是每日 table 每天多次使用同一家公司的结果(预期)。我以为我的分组会处理这个问题。另外,我认为这不是最简单的整体解决方案,我可能想多了/可以更有效地完成。
您似乎想要将每个公司和每个月的 actua
的累计总和与 target
进行比较。您可以使用连接和 window 函数来执行此操作:
select
d.yyyy_mm_dd,
case when sum(d.actual) over(partition by d.name, t.month order by d.yyyy_mm_dd) > t.target
then (sum(d.actual) over(partition by d.name, t.month order by d.yyyy_mm_dd) - t.target) / 100.0
else 0
end payout
from schema.targets t
inner join schema.daily_data d
on month(d.yyyy_mm_dd) = t.month
and d.name = t.name
where
d.yyyy_mm_dd >= '2020-01-01'
and d.name in ('Comp1', 'Comp2')
您对 运行(部分)实际总和的请求很容易通过 window 函数解决。不幸的是我不使用 Hive,所以这是我的 Postgres 工作解决方案
with t (month, name, id, target) as (values
(1 , 'Comp1', 1 , 6000 ),
(2 , 'Comp1', 1 , 6000 ),
(3 , 'Comp1', 1 , 6000 ),
(1 , 'Comp2', 2 , 6000 ),
(2 , 'Comp2', 2 , 6000 ),
(3 , 'Comp2', 2 , 6000 ),
(1 , 'Comp3', 3 , 6000 ),
(2 , 'Comp3', 3 , 6000 ),
(3 , 'Comp3', 3 , 6000 ),
(1 , 'Comp4', 4 , 6000 ),
(2 , 'Comp4', 4 , 6000 ),
(3 , 'Comp4', 4 , 6000 )
), d (yyyy_mm_dd, name, id, actual, region) as (values
( date '2019-01-01' , 'Comp1' , 1 , 1000 , 'LATAM' ),
( date '2019-01-01' , 'Comp1' , 1 , 0 , 'EU' ),
( date '2019-01-02' , 'Comp1' , 1 , 2000 , 'EU' ),
( date '2019-01-03' , 'Comp1' , 1 , 4000 , 'EU' ),
( date '2019-01-01' , 'Comp2' , 2 , 1000 , 'EU' ),
( date '2019-01-02' , 'Comp2' , 2 , 2000 , 'EU' ),
( date '2019-01-03' , 'Comp2' , 2 , 3000 , 'EU' ),
( date '2019-01-01' , 'Comp3' , 3 , 1000 , 'EU' ),
( date '2019-01-02' , 'Comp3' , 3 , 2000 , 'EU' ),
( date '2019-01-03' , 'Comp3' , 3 , 8000 , 'EU' ),
( date '2019-01-01' , 'Comp4' , 4 , 1000 , 'EU' ),
( date '2019-01-02' , 'Comp4' , 4 , 2000 , 'EU' ),
( date '2019-02-03' , 'Comp4' , 4 , 3000 , 'EU' )
)
select dr.yyyy_mm_dd, dr.name, dr.id, dr.actual,
case when dr.running_sum < t.target then 0 else (dr.running_sum - t.target) / 100 end as payment
from t
join (
select dg.*, sum(actual) over (partition by name order by yyyy_mm_dd) as running_sum
from (
select yyyy_mm_dd, name, id, sum(actual) as actual
from d
group by yyyy_mm_dd, name, id
) dg
) dr on dr.name = t.name
and month(dr.yyyy_mm_dd) = t.month -- edited to hive equivalent of postgres' extract(month from dr.yyyy_mm_dd) = t.month
从日期中提取月份可能有不同的方法,但我希望你能理解。
另一种选择是使用开窗 SUM
函数创建 运行 总数,然后在 CASE
语句中使用它来获取列值。
SELECT d.yyyy_mm_dd
,d.name
,d.id
,d.actual
,CASE
WHEN
SUM(d.actual)
OVER (PARTITION BY d.id ORDER BY d.yyyy_mm_dd ROWS UNBOUNDED PRECEDING) <= t.target
THEN 0
ELSE
(
SUM(d.actual)
OVER (PARTITION BY d.id ORDER BY d.yyyy_mm_dd ROWS UNBOUNDED PRECEDING) - t.target
) * 0.01
END AS payout
FROM dailies AS d
JOIN targets AS t
ON d.month = MONTH(d.yyyy_mm_dd)
AND d.id = d.id;
我不是 100% 确定 Hive 语法,但这非常接近。具体来说,ROWS UNBOUNDED PRECEDING
可能还不够。您可能需要其中的 FOLLOWING
子句才能正确计算总数。
我想我现在有了一个可行的解决方案。下面给出了预期的输出。它可能会被优化一点,因为它不是最快的。
SELECT
x.yyyy_mm_dd,
x.id,
x.name,
x.actual,
x.target,
x.actual_to_date,
CASE WHEN x.actual_to_date > x.target THEN ((x.actual_to_date - x.target) /100) * 1 ELSE 0 END AS payout
FROM(
SELECT
daily.yyyy_mm_dd,
daily.id,
daily.name,
daily.actual,
t.target,
SUM(daily.actual) OVER (PARTITION BY MONTH(daily.yyyy_mm_dd), daily.id ORDER BY daily.yyyy_mm_dd RANGE UNBOUNDED PRECEDING) AS actual_to_date
FROM(
SELECT
yyyy_mm_dd,
id,
name,
sum(cast(actual as int)) as actual
FROM
daily_data_table
WHERE
yyyy_mm_dd >= '2020-01-01'
GROUP BY
1,2,3
) daily
INNER JOIN
monthly_target_table t
ON t.id = daily.id AND t.month = month(daily.yyyy_mm_dd)
WHERE
daily.name = 'Comp1'
) x