获取不同子组中的值计数
Get count of values in different subgroups
我需要删除数据集中的一些行,其中 speed
等于零并且持续超过 N 次(假设 N 为 2)。
table demo
的结构如下:
id
car
speed
time
1
foo
0
1
2
foo
0
2
3
foo
0
3
4
foo
1
4
5
foo
1
5
6
foo
0
6
7
bar
0
1
8
bar
0
2
9
bar
5
3
10
bar
5
4
11
bar
5
5
12
bar
5
6
然后我希望通过使用window_function
:
生成一个像下面这样的table
id
car
speed
time
lasting
1
foo
0
1
3
2
foo
0
2
3
3
foo
0
3
3
4
foo
1
4
2
5
foo
1
5
2
6
foo
0
6
1
7
bar
0
1
2
8
bar
0
2
2
9
bar
5
3
4
10
bar
5
4
4
11
bar
5
5
4
12
bar
5
6
4
然后我可以使用 WHERE NOT (speed = 0 AND lasting > 2)
轻松排除这些行
将我试过的代码放在这里,但它没有 return 我预期的值,我猜那些 FROM (SELECT ... FROM (SELECT ...
可能不是解决问题的最佳实践:
SELECT g3.*, count(id) OVER (PARTITION BY car, cumsum ORDER BY id) as num
FROM (SELECT g2.*, sum(grp2) OVER (PARTITION BY car ORDER BY id) AS cumsum
FROM (SELECT g1.*, (CASE ne0 WHEN 0 THEN 0 ELSE 1 END) AS grp2
FROM (SELECT g.*, speed - lag(speed, 1, 0) OVER (PARTITION BY car) AS ne0
FROM (SELECT *, row_number() OVER (PARTITION BY car) AS grp FROM demo) g ) g1 ) g2 ) g3
ORDER BY id;
您可以使用 window 函数 LAG()
检查每行的前一个 speed
值,并使用 SUM()
window 函数创建组连续值。
然后用COUNT()
window函数可以统计每组的行数,这样就可以过滤掉超过2行的组中0speed
的行:
SELECT id, car, speed, time
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY car, grp) counter
FROM (
SELECT *, SUM(flag::int) OVER (PARTITION BY car ORDER BY time) grp
FROM (
SELECT *, speed <> LAG(speed, 1, speed - 1) OVER (PARTITION BY car ORDER BY time) flag
FROM demo
) t
) t
) t
WHERE speed <> 0 OR counter <= 2
ORDER BY id;
参见demo。
我需要删除数据集中的一些行,其中 speed
等于零并且持续超过 N 次(假设 N 为 2)。
table demo
的结构如下:
id | car | speed | time |
---|---|---|---|
1 | foo | 0 | 1 |
2 | foo | 0 | 2 |
3 | foo | 0 | 3 |
4 | foo | 1 | 4 |
5 | foo | 1 | 5 |
6 | foo | 0 | 6 |
7 | bar | 0 | 1 |
8 | bar | 0 | 2 |
9 | bar | 5 | 3 |
10 | bar | 5 | 4 |
11 | bar | 5 | 5 |
12 | bar | 5 | 6 |
然后我希望通过使用window_function
:
id | car | speed | time | lasting |
---|---|---|---|---|
1 | foo | 0 | 1 | 3 |
2 | foo | 0 | 2 | 3 |
3 | foo | 0 | 3 | 3 |
4 | foo | 1 | 4 | 2 |
5 | foo | 1 | 5 | 2 |
6 | foo | 0 | 6 | 1 |
7 | bar | 0 | 1 | 2 |
8 | bar | 0 | 2 | 2 |
9 | bar | 5 | 3 | 4 |
10 | bar | 5 | 4 | 4 |
11 | bar | 5 | 5 | 4 |
12 | bar | 5 | 6 | 4 |
然后我可以使用 WHERE NOT (speed = 0 AND lasting > 2)
将我试过的代码放在这里,但它没有 return 我预期的值,我猜那些 FROM (SELECT ... FROM (SELECT ...
可能不是解决问题的最佳实践:
SELECT g3.*, count(id) OVER (PARTITION BY car, cumsum ORDER BY id) as num
FROM (SELECT g2.*, sum(grp2) OVER (PARTITION BY car ORDER BY id) AS cumsum
FROM (SELECT g1.*, (CASE ne0 WHEN 0 THEN 0 ELSE 1 END) AS grp2
FROM (SELECT g.*, speed - lag(speed, 1, 0) OVER (PARTITION BY car) AS ne0
FROM (SELECT *, row_number() OVER (PARTITION BY car) AS grp FROM demo) g ) g1 ) g2 ) g3
ORDER BY id;
您可以使用 window 函数 LAG()
检查每行的前一个 speed
值,并使用 SUM()
window 函数创建组连续值。
然后用COUNT()
window函数可以统计每组的行数,这样就可以过滤掉超过2行的组中0speed
的行:
SELECT id, car, speed, time
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY car, grp) counter
FROM (
SELECT *, SUM(flag::int) OVER (PARTITION BY car ORDER BY time) grp
FROM (
SELECT *, speed <> LAG(speed, 1, speed - 1) OVER (PARTITION BY car ORDER BY time) flag
FROM demo
) t
) t
) t
WHERE speed <> 0 OR counter <= 2
ORDER BY id;
参见demo。