使用 Big Query 将前两行与当前行和下一行进行比较

Compare preceding two rows with current and next row using Big Query

我有如下所示的数据

rno id day  val
0   1   1   7
1   1   2   5
2   1   3   10
3   1   4   10
4   1   5   11
5   1   6   11
6   1   7   14
7   1   8   14
20  2   1   5
21  2   2   7
22  2   3   8
23  2   4   8
24  2   5   9
25  2   6   9
26  2   7   13
27  2   8   13
28  2   9   15
29  2   10  15

我想创建一个新列 fake_flag 并根据以下两个规则填写值 fake_val

规则 1 - 对于每个 val (n),检查前两行 (n-1,n-2) 是否不变或减少(例如:7,5 或 5,5 是有效的,而 5,7 是无效的,因为它在增加而不是恒定的)并获得最大值作为输出。如果它是 7,5,输出将为 7。如果它是 5,5,那么输出将为 5

规则 2 - 检查当前值 (n) 和下一个值 (n+1) 是否大于规则 1 输出的最大值 3或更多点(>=3)。例如:如果规则 1 的输出是 5,那么我们预计至少会看到 8 (n),8(n+1)。可能是 9,9 或 10,10

我希望我的输出数据如下所示

rno id day  val fake_flag
0   1   1   7     
1   1   2   5     
2   1   3   10    fake_val  # >= 3 from max of preceding 2 rows and `n` and `n+1` is same 
3   1   4   10     
4   1   5   11
5   1   6   11
6   1   7   14    fake_val  # >= 3 from max of preceding 2 rows and `n` and `n+1` is same 
7   1   8   14
20  2   1   5
21  2   2   7
22  2   3   8
23  2   4   8
24  2   5   9
25  2   6   9
26  2   7   13    fake_val    # >= 3 from max of preceding 2 rows and `n` and `n+1` is same 
27  2   8   13
28  2   9   15
29  2   10  15

这应该可以完成您想要的。我用虚拟数据测试了它,但如果我不明白某些部分,请告诉我,我可以修改。

Select *
, CASE WHEN 
  -- Rule 1
  (LAG(val, 1) over w <= LAG(val, 2) over w)  AND 
  (val = LEAD(val, 1) over w) AND -- n = n + 1, part of rule 2
   -- Can assume row n-2 is the max because it will either be the same as row n-1 or greater than row n-1 for rule 1 to be satisfied
  (LAG(val, 2) over w <= val + 3) -- Only have to check current row val because for first part of rule 2 to be satisfied val for row n must equal val for row n + 1
  THEN 'fake_val' -- I would just have a 1 representing it is true and then 0 if not, but up to you 
  ELSE null 
  END as fake_flag
from Dataset.Table_name
WINDOW w as (ORDER BY rno ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)

以下适用于 BigQuery 标准 SQL

#standardSQL
SELECT rno, id, day, val, 
  IF(IFNULL(val_prev2 > val_prev1, FALSE)                   -- rule 1
    OR ( 
      (val - GREATEST(val_prev2, val_prev1) >= 3)           -- rule 2 for val(n)
      AND (val_next - GREATEST(val_prev2, val_prev1) >= 3)  -- rule 2 for val(n+1)
    ), 
    'fake_val', ''
  ) AS fake_flag
FROM (
  SELECT *,
    LAG(val) OVER(PARTITION BY id ORDER BY day) val_prev1,
    LAG(val, 2) OVER(PARTITION BY id ORDER BY day) val_prev2,
    LEAD(val) OVER(PARTITION BY id ORDER BY day) val_next
  FROM `project.dataset.table`
)

如果应用于您问题中的示例数据 - 结果是

Row rno id  day val fake_flag    
1   0   1   1   7        
2   1   1   2   5        
3   2   1   3   10  fake_val     
4   3   1   4   10       
5   4   1   5   11       
6   5   1   6   11       
7   6   1   7   14  fake_val     
8   7   1   8   14       
9   20  2   1   5        
10  21  2   2   7        
11  22  2   3   8        
12  23  2   4   8        
13  24  2   5   9        
14  25  2   6   9        
15  26  2   7   13  fake_val     
16  27  2   8   13       
17  28  2   9   15       
18  29  2   10  15