SQL 查询根据不同列的先前值对行进行计数
SQL query to count rows based on previous values of different column
我在 SAS 工作,我有一个 table 看起来像这样
ID | Time | Main | lag_1 | lag_2
----------------------------------------------------------------------------
A | 01 | 0 | 0 | 1
A | 03 | 0 | 0 | 1
A | 04 | 0 | 0 | 0
A | 10 | 1 | 0 | 0
A | 11 | 1 | 0 | 0
A | 12 | 1 | 0 | 0
B | 02 | 1 | 1 | 1
B | 04 | 0 | 1 | 1
B | 07 | 0 | 0 | 1
B | 10 | 1 | 0 | 0
B | 11 | 1 | 0 | 0
B | 12 | 1 | 0 | 0
除非有多个 ID。 table 按 ID 和时间排序。在计算主列中的总计数后(称之为 tot),我试图计算 2 个东西:
- 仅当lag_1在某个时间在 Main变为1之前等于1时,Main列中的总计数,比如说 tot_1;和
- 与 1 相同。但在本例中,对于 lag_2,调用变量 tot_2
预期计算的 table 会给我
tot | tot_1 | tot_2
--------------------
7 | 3 | 6
因为 tot_1 应该是 3(0 来自 ID = A + 3 来自 ID = B),并且 tot_2 应该是 6(3 来自 ID = A + 3 来自 ID = B)。
我是这些类型细分的完全初学者,因此非常感谢您的帮助。
编辑:我希望 tot_2 >= tot_1 因为 lag_2 是建立在 Main 事件的基础上的,它比 lag_1 回溯的时间更长。
如果我没理解错的话,你想要每个 id 的这些总和。关键是比较不同情况下id的最小值,然后求和。这是所有条件聚合:
select sum(tot) as tot,
sum(case when id_lag_1 < id_main then tot else 0 end) as tot_1,
sum(case when id_lag_2 < id_main then tot else 0 end) as tot_2
from (select id, sum(main) as tot,
min(case when main = 1 then id end) as id_main,
min(case when lag_1 = 1 then id end) as id_lag_1,
min(case when lag_2 = 1 then id end) as id_lag_2
from t
group by id
) t;
考虑 tot_1 和 tot_2
的计算
我的第一步是寻找 lag_1 > main 的模式(这满足你提到的情况,即在 main=1 之前的某个时间找到 lag_1=1 的记录)和我将所有这些值命名为 'grp_lag_1' 和 'grp_lag_2'
一旦我对记录进行了分组,我 "copy" 使用 max() over(order by id,time1) 降低值。
select *
,max(case when lag_1 > main then 'grp_lag_1' end) over(partition by id order by id,time1) as grp_1
,max(case when lag_2 > main then 'grp_lag_2' end) over(partition by id order by id,time1) as grp_2
from t
所以我得到如下结果
+----+-------+------+-------+-------+-----------+-----------+
| id | time1 | main | lag_1 | lag_2 | grp_1 | grp_2 |
+----+-------+------+-------+-------+-----------+-----------+
| A | 01 | 0 | 0 | 1 | | grp_lag_2 |
| A | 03 | 0 | 0 | 1 | | grp_lag_2 |
| A | 04 | 0 | 0 | 0 | | grp_lag_2 |
| A | 10 | 1 | 0 | 0 | | grp_lag_2 |
| A | 11 | 1 | 0 | 0 | | grp_lag_2 |
| A | 12 | 1 | 0 | 0 | | grp_lag_2 |
| B | 02 | 1 | 1 | 1 | | |
| B | 04 | 0 | 1 | 1 | grp_lag_1 | grp_lag_2 |
| B | 07 | 0 | 0 | 1 | grp_lag_1 | grp_lag_2 |
| B | 10 | 1 | 0 | 0 | grp_lag_1 | grp_lag_2 |
| B | 11 | 1 | 0 | 0 | grp_lag_1 | grp_lag_2 |
| B | 12 | 1 | 0 | 0 | grp_lag_1 | grp_lag_2 |
+----+-------+------+-------+-------+-----------+-----------+
在此之后,如果我要总结 grp_lag_1 的主要值,我会得到 tot_1 并且同样总结 grp+lag_2 我会得到 tot_2
select sum(main) as tot_cnt
,sum(case when grp_1='grp_lag_1' then main end) as tot_1
,sum(case when grp_2='grp_lag_2' then main end) as tot_2
from(
select *
,max(case when lag_1 > main then 'grp_lag_1' end) over(partition by id order by id,time1) as grp_1
,max(case when lag_2 > main then 'grp_lag_2' end) over(partition by id order by id,time1) as grp_2
from t
)x
+---------+-------+-------+
| tot_cnt | tot_1 | tot_2 |
+---------+-------+-------+
| 7 | 3 | 6 |
+---------+-------+-------+
演示
https://dbfiddle.uk/?rdbms=sqlserver_2012&fiddle=c17be111dbc3c516afa2bc3dcd3c9e1c
在数据步骤中更容易做到。这样你就可以检查新 id 的开始并重置 lag_x 变量是否为真的标志。
data want ;
set have end=eof;
by id time ;
tot + main ;
if first.id then call missing(any_lag_1,any_lag_2);
if any_lag_1 then tot_1 + main ;
if any_lag_2 then tot_2 + main ;
if eof then output;
any_lag_1+lag_1;
any_lag_2+lag_2;
keep tot: ;
run;
我在 SAS 工作,我有一个 table 看起来像这样
ID | Time | Main | lag_1 | lag_2
----------------------------------------------------------------------------
A | 01 | 0 | 0 | 1
A | 03 | 0 | 0 | 1
A | 04 | 0 | 0 | 0
A | 10 | 1 | 0 | 0
A | 11 | 1 | 0 | 0
A | 12 | 1 | 0 | 0
B | 02 | 1 | 1 | 1
B | 04 | 0 | 1 | 1
B | 07 | 0 | 0 | 1
B | 10 | 1 | 0 | 0
B | 11 | 1 | 0 | 0
B | 12 | 1 | 0 | 0
除非有多个 ID。 table 按 ID 和时间排序。在计算主列中的总计数后(称之为 tot),我试图计算 2 个东西:
- 仅当lag_1在某个时间在 Main变为1之前等于1时,Main列中的总计数,比如说 tot_1;和
- 与 1 相同。但在本例中,对于 lag_2,调用变量 tot_2
预期计算的 table 会给我
tot | tot_1 | tot_2
--------------------
7 | 3 | 6
因为 tot_1 应该是 3(0 来自 ID = A + 3 来自 ID = B),并且 tot_2 应该是 6(3 来自 ID = A + 3 来自 ID = B)。
我是这些类型细分的完全初学者,因此非常感谢您的帮助。
编辑:我希望 tot_2 >= tot_1 因为 lag_2 是建立在 Main 事件的基础上的,它比 lag_1 回溯的时间更长。
如果我没理解错的话,你想要每个 id 的这些总和。关键是比较不同情况下id的最小值,然后求和。这是所有条件聚合:
select sum(tot) as tot,
sum(case when id_lag_1 < id_main then tot else 0 end) as tot_1,
sum(case when id_lag_2 < id_main then tot else 0 end) as tot_2
from (select id, sum(main) as tot,
min(case when main = 1 then id end) as id_main,
min(case when lag_1 = 1 then id end) as id_lag_1,
min(case when lag_2 = 1 then id end) as id_lag_2
from t
group by id
) t;
考虑 tot_1 和 tot_2
的计算我的第一步是寻找 lag_1 > main 的模式(这满足你提到的情况,即在 main=1 之前的某个时间找到 lag_1=1 的记录)和我将所有这些值命名为 'grp_lag_1' 和 'grp_lag_2'
一旦我对记录进行了分组,我 "copy" 使用 max() over(order by id,time1) 降低值。
select *
,max(case when lag_1 > main then 'grp_lag_1' end) over(partition by id order by id,time1) as grp_1
,max(case when lag_2 > main then 'grp_lag_2' end) over(partition by id order by id,time1) as grp_2
from t
所以我得到如下结果
+----+-------+------+-------+-------+-----------+-----------+
| id | time1 | main | lag_1 | lag_2 | grp_1 | grp_2 |
+----+-------+------+-------+-------+-----------+-----------+
| A | 01 | 0 | 0 | 1 | | grp_lag_2 |
| A | 03 | 0 | 0 | 1 | | grp_lag_2 |
| A | 04 | 0 | 0 | 0 | | grp_lag_2 |
| A | 10 | 1 | 0 | 0 | | grp_lag_2 |
| A | 11 | 1 | 0 | 0 | | grp_lag_2 |
| A | 12 | 1 | 0 | 0 | | grp_lag_2 |
| B | 02 | 1 | 1 | 1 | | |
| B | 04 | 0 | 1 | 1 | grp_lag_1 | grp_lag_2 |
| B | 07 | 0 | 0 | 1 | grp_lag_1 | grp_lag_2 |
| B | 10 | 1 | 0 | 0 | grp_lag_1 | grp_lag_2 |
| B | 11 | 1 | 0 | 0 | grp_lag_1 | grp_lag_2 |
| B | 12 | 1 | 0 | 0 | grp_lag_1 | grp_lag_2 |
+----+-------+------+-------+-------+-----------+-----------+
在此之后,如果我要总结 grp_lag_1 的主要值,我会得到 tot_1 并且同样总结 grp+lag_2 我会得到 tot_2
select sum(main) as tot_cnt
,sum(case when grp_1='grp_lag_1' then main end) as tot_1
,sum(case when grp_2='grp_lag_2' then main end) as tot_2
from(
select *
,max(case when lag_1 > main then 'grp_lag_1' end) over(partition by id order by id,time1) as grp_1
,max(case when lag_2 > main then 'grp_lag_2' end) over(partition by id order by id,time1) as grp_2
from t
)x
+---------+-------+-------+
| tot_cnt | tot_1 | tot_2 |
+---------+-------+-------+
| 7 | 3 | 6 |
+---------+-------+-------+
演示 https://dbfiddle.uk/?rdbms=sqlserver_2012&fiddle=c17be111dbc3c516afa2bc3dcd3c9e1c
在数据步骤中更容易做到。这样你就可以检查新 id 的开始并重置 lag_x 变量是否为真的标志。
data want ;
set have end=eof;
by id time ;
tot + main ;
if first.id then call missing(any_lag_1,any_lag_2);
if any_lag_1 then tot_1 + main ;
if any_lag_2 then tot_2 + main ;
if eof then output;
any_lag_1+lag_1;
any_lag_2+lag_2;
keep tot: ;
run;