优化 SQL 查询 - 在组中查找组

Optimizing SQL query - finding a group with in a group

我有一个有效的查询,正在寻找优化它的想法。

查询说明:在每个 ID 组 (visitor_id) 中,查找 c_id != 0 所在的行。从该行开始,显示该 ID 组中的所有连续行。

select t2.*
from (select *, row_number() OVER (PARTITION BY visitor_id ORDER BY date) as row_number
  from "DB"."schema"."table"
  where visitor_id in 
     (select distinct visitor_id 
     from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
     where c_id in ('101')
     ) 
   ) as t2
inner join
(select visitor_id, min(rn) as row_number
from
  (select *, row_number() OVER (PARTITION BY visitor_id ORDER BY date) as rn
  from "DB"."schema"."table"
  where visitor_id in 
     (select distinct visitor_id 
     from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
     where c_id in ('101')
     ) 
   ) as filtered_table
where c_id != 0
group by visitor_id) as t1

on t2.visitor_id = t1.visitor_id
and t2.row_number >= t1.row_number

所以你有一个共同的子表达式

select distinct visitor_id 
     from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
     where c_id in ('101')

这样就可以将其移至 CTE 并且 运行 只需一次。喜欢

WITH distinct_visitors AS (
    SELECT DISTINCT visitor_id 
     FROM (SELECT * FROM "DB"."schema"."table" WHERE date >= '2021-08-01' and date <= '2021-08-30')
     where c_id in ('101')
)

但是子句筛选器作为顶级筛选器同样有效,并且给定它是一个值包含范围筛选器 BETWEEN 将提供更好的性能。

WITH distinct_visitors AS (
   SELECT DISTINCT visitor_id 
    FROM "DB"."schema"."table" 
    WHERE date BETWEEN '2021-08-01' AND'2021-08-30'
        AND c_id IN ('101')
)

然后该 CTE 的两个用途都执行相同的 ROW_NUMBER 操作,因此可以成为 CTE

并简化为

WITH rw_rows AS (
    SELECT *, 
        ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
    FROM "DB"."schema"."table"
    WHERE visitor_id IN (
        SELECT DISTINCT visitor_id 
        FROM "DB"."schema"."table" 
        WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
            AND  c_id in ('101')
    )
)
SELECT t2.*
FROM rw_rows AS t2
JOIN (
    SELECT visitor_id, 
        min(rn) AS row_number
    FROM rw_rows AS filtered_table
    WHERE c_id != 0
    GROUP BY visitor_id
) AS t1
    ON t2.visitor_id = t1.visitor_id
        AND t2.row_number >= t1.row_number

所以我们希望保留第一个非零 c_id 之后的所有行,QUALIFY 应该能够像这样解决:

WITH rw_rows AS (
    SELECT *, 
        ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
    FROM "DB"."schema"."table"
    WHERE visitor_id IN (
        SELECT DISTINCT visitor_id 
        FROM "DB"."schema"."table" 
        WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
            AND  c_id in ('101')
    )
)
SELECT t2.*,
    MIN(IFF(c_id != 0, row_number, NULL )) OVER (PARTITION BY visitor_id) as min_rn
FROM rw_rows AS t2
QUALIFY t2.row_number >= min_rn

没有 运行 感觉 MIN 也应该能够移动到 QUALIFY,例如:

WITH rw_rows AS (
    SELECT *, 
        ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
    FROM "DB"."schema"."table"
    WHERE visitor_id IN (
        SELECT DISTINCT visitor_id 
        FROM "DB"."schema"."table" 
        WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
            AND  c_id in ('101')
    )
)
SELECT t2.*
FROM rw_rows AS t2
QUALIFY t2.row_number >= MIN(IFF(c_id != 0, row_number, NULL )) OVER (PARTITION BY visitor_id)

此时不需要 CTE,因为它只使用了一次,所以可以移回,也可以不移回,因为它们是一样的。