使用 Big Query 将前两行与当前行和下一行进行比较
Compare preceding two rows with current and next row using Big Query
我有如下所示的数据
rno id day val
0 1 1 7
1 1 2 5
2 1 3 10
3 1 4 10
4 1 5 11
5 1 6 11
6 1 7 14
7 1 8 14
20 2 1 5
21 2 2 7
22 2 3 8
23 2 4 8
24 2 5 9
25 2 6 9
26 2 7 13
27 2 8 13
28 2 9 15
29 2 10 15
我想创建一个新列 fake_flag
并根据以下两个规则填写值 fake_val
规则 1 - 对于每个 val (n
),检查前两行 (n-1
,n-2
) 是否不变或减少(例如:7,5 或 5,5 是有效的,而 5,7 是无效的,因为它在增加而不是恒定的)并获得最大值作为输出。如果它是 7,5,输出将为 7。如果它是 5,5,那么输出将为 5
规则 2 - 检查当前值 (n
) 和下一个值 (n+1
) 是否大于规则 1 输出的最大值 3或更多点(>=3)。例如:如果规则 1 的输出是 5,那么我们预计至少会看到 8 (n
),8(n+1
)。可能是 9,9 或 10,10
我希望我的输出数据如下所示
rno id day val fake_flag
0 1 1 7
1 1 2 5
2 1 3 10 fake_val # >= 3 from max of preceding 2 rows and `n` and `n+1` is same
3 1 4 10
4 1 5 11
5 1 6 11
6 1 7 14 fake_val # >= 3 from max of preceding 2 rows and `n` and `n+1` is same
7 1 8 14
20 2 1 5
21 2 2 7
22 2 3 8
23 2 4 8
24 2 5 9
25 2 6 9
26 2 7 13 fake_val # >= 3 from max of preceding 2 rows and `n` and `n+1` is same
27 2 8 13
28 2 9 15
29 2 10 15
这应该可以完成您想要的。我用虚拟数据测试了它,但如果我不明白某些部分,请告诉我,我可以修改。
Select *
, CASE WHEN
-- Rule 1
(LAG(val, 1) over w <= LAG(val, 2) over w) AND
(val = LEAD(val, 1) over w) AND -- n = n + 1, part of rule 2
-- Can assume row n-2 is the max because it will either be the same as row n-1 or greater than row n-1 for rule 1 to be satisfied
(LAG(val, 2) over w <= val + 3) -- Only have to check current row val because for first part of rule 2 to be satisfied val for row n must equal val for row n + 1
THEN 'fake_val' -- I would just have a 1 representing it is true and then 0 if not, but up to you
ELSE null
END as fake_flag
from Dataset.Table_name
WINDOW w as (ORDER BY rno ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT rno, id, day, val,
IF(IFNULL(val_prev2 > val_prev1, FALSE) -- rule 1
OR (
(val - GREATEST(val_prev2, val_prev1) >= 3) -- rule 2 for val(n)
AND (val_next - GREATEST(val_prev2, val_prev1) >= 3) -- rule 2 for val(n+1)
),
'fake_val', ''
) AS fake_flag
FROM (
SELECT *,
LAG(val) OVER(PARTITION BY id ORDER BY day) val_prev1,
LAG(val, 2) OVER(PARTITION BY id ORDER BY day) val_prev2,
LEAD(val) OVER(PARTITION BY id ORDER BY day) val_next
FROM `project.dataset.table`
)
如果应用于您问题中的示例数据 - 结果是
Row rno id day val fake_flag
1 0 1 1 7
2 1 1 2 5
3 2 1 3 10 fake_val
4 3 1 4 10
5 4 1 5 11
6 5 1 6 11
7 6 1 7 14 fake_val
8 7 1 8 14
9 20 2 1 5
10 21 2 2 7
11 22 2 3 8
12 23 2 4 8
13 24 2 5 9
14 25 2 6 9
15 26 2 7 13 fake_val
16 27 2 8 13
17 28 2 9 15
18 29 2 10 15
我有如下所示的数据
rno id day val
0 1 1 7
1 1 2 5
2 1 3 10
3 1 4 10
4 1 5 11
5 1 6 11
6 1 7 14
7 1 8 14
20 2 1 5
21 2 2 7
22 2 3 8
23 2 4 8
24 2 5 9
25 2 6 9
26 2 7 13
27 2 8 13
28 2 9 15
29 2 10 15
我想创建一个新列 fake_flag
并根据以下两个规则填写值 fake_val
规则 1 - 对于每个 val (n
),检查前两行 (n-1
,n-2
) 是否不变或减少(例如:7,5 或 5,5 是有效的,而 5,7 是无效的,因为它在增加而不是恒定的)并获得最大值作为输出。如果它是 7,5,输出将为 7。如果它是 5,5,那么输出将为 5
规则 2 - 检查当前值 (n
) 和下一个值 (n+1
) 是否大于规则 1 输出的最大值 3或更多点(>=3)。例如:如果规则 1 的输出是 5,那么我们预计至少会看到 8 (n
),8(n+1
)。可能是 9,9 或 10,10
我希望我的输出数据如下所示
rno id day val fake_flag
0 1 1 7
1 1 2 5
2 1 3 10 fake_val # >= 3 from max of preceding 2 rows and `n` and `n+1` is same
3 1 4 10
4 1 5 11
5 1 6 11
6 1 7 14 fake_val # >= 3 from max of preceding 2 rows and `n` and `n+1` is same
7 1 8 14
20 2 1 5
21 2 2 7
22 2 3 8
23 2 4 8
24 2 5 9
25 2 6 9
26 2 7 13 fake_val # >= 3 from max of preceding 2 rows and `n` and `n+1` is same
27 2 8 13
28 2 9 15
29 2 10 15
这应该可以完成您想要的。我用虚拟数据测试了它,但如果我不明白某些部分,请告诉我,我可以修改。
Select *
, CASE WHEN
-- Rule 1
(LAG(val, 1) over w <= LAG(val, 2) over w) AND
(val = LEAD(val, 1) over w) AND -- n = n + 1, part of rule 2
-- Can assume row n-2 is the max because it will either be the same as row n-1 or greater than row n-1 for rule 1 to be satisfied
(LAG(val, 2) over w <= val + 3) -- Only have to check current row val because for first part of rule 2 to be satisfied val for row n must equal val for row n + 1
THEN 'fake_val' -- I would just have a 1 representing it is true and then 0 if not, but up to you
ELSE null
END as fake_flag
from Dataset.Table_name
WINDOW w as (ORDER BY rno ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT rno, id, day, val,
IF(IFNULL(val_prev2 > val_prev1, FALSE) -- rule 1
OR (
(val - GREATEST(val_prev2, val_prev1) >= 3) -- rule 2 for val(n)
AND (val_next - GREATEST(val_prev2, val_prev1) >= 3) -- rule 2 for val(n+1)
),
'fake_val', ''
) AS fake_flag
FROM (
SELECT *,
LAG(val) OVER(PARTITION BY id ORDER BY day) val_prev1,
LAG(val, 2) OVER(PARTITION BY id ORDER BY day) val_prev2,
LEAD(val) OVER(PARTITION BY id ORDER BY day) val_next
FROM `project.dataset.table`
)
如果应用于您问题中的示例数据 - 结果是
Row rno id day val fake_flag
1 0 1 1 7
2 1 1 2 5
3 2 1 3 10 fake_val
4 3 1 4 10
5 4 1 5 11
6 5 1 6 11
7 6 1 7 14 fake_val
8 7 1 8 14
9 20 2 1 5
10 21 2 2 7
11 22 2 3 8
12 23 2 4 8
13 24 2 5 9
14 25 2 6 9
15 26 2 7 13 fake_val
16 27 2 8 13
17 28 2 9 15
18 29 2 10 15