Oracle SQL 或 PLSQL。 Select 行(按分区,值具有特定顺序)

Oracle SQL or PLSQL. Select rows by partitions which values have specific order

任务:select连续参加至少2场比赛的运动员(2场比赛依次进行;1-2-3-4-5:2&4或1&3&5不行,1&2可以好的,1&2&3 没问题,1&2 和 4&5 没问题)。 问题:找到最好的方法(更快,更少的资源)

工作table:

每个competition_id都有一个hold_date。

每个 sportsman_id 每个 competition_id 只有一个结果。

这适用于结果中的 25 行 table:

SELECT DISTINCT sportsman_id, sportsman_name, rank, year_of_birth, personal_record, country
FROM
    (
    SELECT sportsman_id, hold_date,
        LAG (comp_order, 1) OVER (PARTITION BY sportsman_id ORDER BY sportsman_id) prev_comp_number
        , comp_order
    FROM result
    INNER JOIN
        (
        SELECT hold_date, ROW_NUMBER() OVER (ORDER BY hold_date) AS comp_order
        FROM
            (
            SELECT DISTINCT hold_date
            FROM result
            )
        ) USING (hold_date)
    ORDER BY sportsman_id, comp_order
    )
INNER JOIN sportsman USING (sportsman_id)
WHERE comp_order-prev_comp_number=1
;

带注释的代码截图:

示例数据:

上面代码的结果(=期望的结果)

假设有数百万行(数以千计的比赛和数以千计的运动员)。我的代码有多可靠?

我认为如果 sportsman_id 只出现一次(如果运动员只参加(获得结果)一场比赛,他显然不可能是那个),通过排除行数来减少行数。 像这样的东西:(还没有实现(不知道如何或最有可能 when/where))

SELECT re.hold_date, r.sportsman_id
FROM result r
INNER JOIN result re ON (re.sportsman_id=r.sportsman_id)
GROUP BY r.sportsman_id, re.hold_date
HAVING COUNT(r.sportsman_id) > 1
;

然后,我想对于 LAG,我只将现有的列加倍,这很好吗?

使用 PLSQL 有没有更简单的方法?或者有一个函数可以完成我的部分代码?

如果您对包含完整比赛列表的结果执行分区外部联接,那么当参赛者未参加比赛时,您将有 NULL 行。然后可以用MATCH_RECOGNIZE依次比较行和COUNT他们参加的连续比赛的次数,剔除只参加过一次比赛而前后没有参加比赛的运动员。

SELECT sportsman_id
FROM   (
  SELECT sportsman_id,
         c.competition_id,
         c.hold_date,
         NVL2( r.competition_id, 1, 0 ) AS attended
  FROM   ( SELECT DISTINCT
                  competition_id,
                  hold_date
           FROM   result
         ) c
         LEFT OUTER JOIN result r
         PARTITION BY ( r.sportsman_id )
         ON ( c.competition_id = r.competition_id )
)
MATCH_RECOGNIZE (
  PARTITION BY sportsman_id
  ORDER BY hold_date
  MEASURES COUNT(*) AS num_sequential
  ONE ROW PER MATCH
  PATTERN ( ATTENDED_COMP+ )
  DEFINE
    ATTENDED_COMP AS (
      ATTENDED_COMP.attended = 1
    )
)
GROUP BY sportsman_id
HAVING MIN( num_sequential ) > 1;

因此,对于示例数据:

CREATE TABLE result ( competition_id, sportsman_id, hold_date ) AS
SELECT 1, 1, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 2, 1, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 3, 1, DATE '2020-03-01' FROM DUAL UNION ALL
SELECT 4, 1, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 5, 1, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 1, 2, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 2, 2, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 4, 2, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 5, 2, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 2, 3, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 4, 3, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 1, 4, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 3, 4, DATE '2020-03-01' FROM DUAL UNION ALL
SELECT 5, 4, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 1, 5, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 2, 5, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 5, 5, DATE '2020-05-01' FROM DUAL;

输出为:

| SPORTSMAN_ID |
| -----------: |
|            1 |
|            2 |

db<>fiddle here


如果您想要参加过任何一组连续比赛的运动员(不管他们的所有比赛是否都包含在连续组中),那么您可以将最后一行更改为:

HAVING MAX( num_sequential ) > 1;

输出为:

| SPORTSMAN_ID |
| -----------: |
|            1 |
|            2 |
|            5 |

db<>fiddle here


或者,如果您想要匹配范围的详细信息,您可以使用 PATTERN ( ATTENDED_COMP{2,} ) 仅匹配参赛者连续参加两个或更多比赛的连续组:

SELECT *
FROM   (
  SELECT sportsman_id,
         c.competition_id,
         c.hold_date,
         NVL2( r.competition_id, 1, 0 ) AS attended
  FROM   ( SELECT DISTINCT
                  competition_id,
                  hold_date
           FROM   result
         ) c
         LEFT OUTER JOIN result r
         PARTITION BY ( r.sportsman_id )
         ON ( c.competition_id = r.competition_id )
)
MATCH_RECOGNIZE (
  PARTITION BY sportsman_id
  ORDER BY hold_date
  MEASURES
    FIRST( competition_id ) AS first_competition_id,
    FIRST( hold_date ) AS first_hold_date,
    LAST( competition_id ) AS last_competition_id,
    LAST( hold_date ) AS last_hold_date
  ONE ROW PER MATCH
  PATTERN ( ATTENDED_COMP{2,} )
  DEFINE
    ATTENDED_COMP AS ( ATTENDED_COMP.attended = 1 )
)

输出:

SPORTSMAN_ID | FIRST_COMPETITION_ID | FIRST_HOLD_DATE     | LAST_COMPETITION_ID | LAST_HOLD_DATE     
-----------: | -------------------: | :------------------ | ------------------: | :------------------
           1 |                    1 | 2020-01-01 00:00:00 |                   5 | 2020-05-01 00:00:00
           2 |                    1 | 2020-01-01 00:00:00 |                   2 | 2020-02-01 00:00:00
           2 |                    4 | 2020-04-01 00:00:00 |                   5 | 2020-05-01 00:00:00
           5 |                    1 | 2020-01-01 00:00:00 |                   2 | 2020-02-01 00:00:00

db<>fiddle here

你可以只阅读 table 一次,使用 Tabibitosan 方法将连续的比赛分组在一起 https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/#:%7E:text=The%20SQL%20of%20Gaps%20and%20Islands%20in%20Sequences,...%204%20Performance%20Comparison%20of%20Gaps%20Solutions.%20

这里你必须使用 add_months 因为你的比赛相隔几个月:

select sportsman_id, min(hold_date) , max(hold_date), comps_in_island
from (
 select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
 from (
  select  competition_id, sportsman_id, hold_date , add_months(hold_date,-1*row_number() over(partition by sportsman_id order by hold_date)) island
  from    result
 )
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;

DB fiddle: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=1b707262722bc555ad851aee029b347a

-编辑 我对一些数据感到困惑,看起来重要的不是日期而是 competition_id。如果你有一个无间隙的 competition_id 序列,这会让事情变得更简单(所以比赛 65786162213 在 4 之后是 657 亿个事件)

select sportsman_id, min(competition_id) , max(competition_id), comps_in_island
from (
 select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
 from 
  select  competition_id, sportsman_id, hold_date , competition_id -row_number() over(partition by sportsman_id order by competition_id)) island
  from    result
 )
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;

或者,如果您需要首先计算出比赛号码,您只需要使用 dense_rank 的额外子查询来对唯一的 competition_ids 进行排名:

select sportsman_id, min(competition_id) , max(competition_id), comps_in_island
from (
 select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
 from (
  select  competition_id, sportsman_id, hold_date , comp_number -row_number() over(partition by sportsman_id order by comp_number) island
  from (  
   select  competition_id, sportsman_id, hold_date , dense_rank() over (partition by null order by competition_id) comp_number
   from    result
  )
 )
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;

这确实假定您关心的每个可能 competion_id 结果中都有一行。

如果您只想要一个至少参加过两次连续比赛的运动员列表,那么使用 lag() juste once 就足够了:

select distinct sportman_id
from (
    select sportman_id, competition_id
        lag(competition_id) over(partition by sportman_id, oder by competition_id) lag_competition_id
    from result r
) r
where competition_id = lag_competition_id + 1

你可以带上相应的sportsmanexists:

select s.*
from sportman s
where exists (
    select 1
    from (
        select sportman_id, competition_id
            lag(competition_id) over(partition by sportman_id, oder by competition_id) lag_competition_id
        from result r
    ) r
    where r.competition_id = r.lag_competition_id + 1 and r.sportman_id = s.sportman_id
)

你说每场比赛总是只有一个日期。因此,这个日期应该在比赛 table 中,而不是在结果 table 中。你还说日期不重叠(同一日期没有两场比赛 - 如果比赛日期 table),这也可以确保有约束力。

第一步按顺序获取比赛/日期。使用您的数据模型:

select distinct hold_date
from result
order by hold_date;

要快速获得此结果,请提供日期索引:

create index idx1 on result (hold_date);

您甚至可以使用 ROW_NUMBER 对这些进行编号,或者使用 LAGLEAD 来查看日期及其相邻日期。

现在,寻找连续参加两个项目的运动员的最佳方法很大程度上取决于运动员参加一般比赛的频率。

  1. 如果他们很少参加,比如说,通常只有两次,我们可以参加并快速查看结果。
  2. 如果他们参与很多,比如说,通常参与大约一半的事件,我们希望遍历事件并在找到连续事件后停止,而不是继续阅读。

这里是第二种方法的查询。我们使用递归查询(因为这是我们在 SQL 中应用迭代过程的方式)。我们从所有运动员和第一次约会开始。然后我们去第二次约会,并为所有参加过两次的人停下来。剩下的我们看看第三次约会,然后再次为参加第二次和第三次的人停下来。等等。

日期和运动员应该有一个索引可以快速查找结果行。我什至会提供两个索引,因为我不知道哪个列更有选择性。所以,让 DBMS 来决定吧。

create index idx2 on result (hold_date, sportsman_id);
create index idx3 on result (sportsman_id, hold_date);

这里是查询:

with dates as 
(
  select
    hold_date,
    lead(hold_date) over (order by hold_date) as next_date,
    min(hold_date) over (order by hold_date) as min_date
  from (select distinct hold_date from result)
)
, cte (sportsman_id, sportsman_name, rank, year_of_birth, personal_record, country,
       hold_date, next_date, was_in, is_in) as
(
  select
    s.sportsman_id, s.sportsman_name, s.rank, s.year_of_birth,
    s.personal_record, s.country, d.hold_date, d.next_date, 'NO',
    case when r.hold_date is not null then 'YES' else 'NO' end
  from sportsman s
  cross join (select * from dates where hold_date = min_date) d
  left join result r on r.sportsman_id = s.sportsman_id
                     and r.hold_date = d.hold_date
  union all
  select
    s.sportsman_id, s.sportsman_name, s.rank, s.year_of_birth,
    s.personal_record, s.country, d.hold_date, d.next_date, s.is_in,
    case when r.hold_date is not null then 'YES' else 'NO' end
  from cte s
  join dates d on d.hold_date = s.next_date
  left join result r on r.sportsman_id = s.sportsman_id
                     and r.hold_date = d.hold_date
  where not (s.was_in = 'YES' and s.is_in = 'YES')
)
select sportsman_id, sportsman_name, rank, year_of_birth, personal_record, country
from cte
where was_in = 'YES' and is_in = 'YES';