涉及 CLOB 数据的 GROUP BY

Question

三个table、test_3、test_2和test_1之间存在连接。

test_1 和 test_3 是主要的 table，没有共同的列。还有 table test_2 加入。 test_1 有 sr_id、last_updated_date、
test_2 有 sr_id 和 sm_id，test_3 有 sm_id、sql_statement。 test_3 有导致所有问题的 clob 数据。

我必须找到与 sm_id 关联的最新 sr_id。我的想法是使用聚合函数 max(last_updated_date) 并将其分组。由于很多原因，它没有发生。

它包含列为 sql_statement 的 CLOB 数据。
我使用了一个我不熟悉的连接

任何想法都会有所帮助。

WITH xx as (
    (select  ANSWER ,sr_id AS ID from test 
    WHERE Q_ID in (SELECT Q_ID FROM test_2 WHERE field_id='LM_LRE_Q6')
    ) 
)
-- end of source data


SELECT t.ID, t1.n, t1.SM_ID,seg_dtls.SEGMENTation_NAME ,to_char(mst.LAST_UPDATED_DATE,'dd-mon-yyyy hh24:mi:ss'),seg_dtls.sql_statement
FROM xx t
CROSS JOIN LATERAL (
        select LEVEL AS n, regexp_substr( t.answer, '\d+',  1, level) as SM_ID
        from dual
        connect by regexp_substr( t.answer, '\d+',  1, level) IS NOT NULL
) t1
left join test_1 mst 
on mst.sr_id=t.id
right join test_3 seg_dtls
on seg_dtls.sm_id=t1.sm_id;

样本数据看起来像

sr_id   sm_id SEGMENTATION_NAME  LAST_UPDATED_DATE  
1108197 958   test_not_in          05-feb-2017 23:56:59    
1108217 958   test_not_in          14-feb-2017 00:37:39  
1108218 958   test_not_in          14-feb-2017 01:39:50  
1108220 958   test_not_in          14-feb-2017 03:39:07

预期输出为

1108220 958   test_not_in          14-feb-2017 03:39:07

我没有发布 CLOB 数据，因为它很大。每行包含 CLOB 数据。

table test_3 contains  
q_id     sr_id  answer   
1009330 1108246 976~feb_24^941~Test_regionwithcountry  
1009330 1108247 941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24  
1009330 1108239 972~test_emea  
1009330 1108240 972~test_emea^827~test_with_region_country  
1009330 1108251 981~MSE100579729 testing.

样本数据类似于 test_3
的上方答案包含 sm_id。我必须从这里把它拉出来。
例如：

941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24  
the sm_id is 941,787,976

。

所以我带着上面发布的查询来了。
同样，对于左连接和右连接，需要 test_3 中的所有 sm_id，所以我在这里使用了右连接。

edit1：接受的答案给出了 SR_ID OF SEGMENTS with max(last_updated_date).
我需要全部 SR_ID。因此，我使用 MINUS 运算符来获取不是 max(last_updated_date).
的那些我需要将该结果集附加到已接受的答案中。

这就是我为获得其他 SR_ID 所做的。

select sr_id,segmentation_name,request_status from (with test_31 (q_id, sr_id, answer) as (
 (SELECT Q_ID,SR_ID,ANSWER FROM test_3 WHERE Q_ID=(SELECT Q_ID FROM test_4 WHERE FIELD_ID='LM_LRE_Q6'))
),
answer_extraction as (
  select q_id, sr_id,
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level),'\d+') as sm_id
  from test_31
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '[^^]+', 1, level) is not null
)
select sr_id,
  sm_id,
  segmentation_name,
  LAST_UPDATED_DATE,
  sql_statement,request_status
from (
  select t1.sr_id,
    t2.sm_id,
    t2.segmentation_name,
    t1.last_updated_date,
    t2.sql_statement,
    t1.request_status

  from test_4 t4
  join answer_extraction t3 on t3.q_id = t4.q_id
  join test_2 t2 on t2.sm_id = t3.sm_id
  join test1 t1 on t1.sr_id = t3.sr_id
)
)
minus

(select  sr_id,segmentation_name , request_status from (with test_31 (q_id, sr_id, answer) as (
 (SELECT Q_ID,SR_ID,ANSWER FROM test_3 WHERE Q_ID=(SELECT Q_ID FROM test_4 WHERE FIELD_ID='LM_LRE_Q6'))
),
answer_extraction as (
  select q_id, sr_id,
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level), '\d+') as sm_id
  from test_31
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '[^^]+', 1, level) is not null
)
select sr_id,
  segmentation_name,
  sql_statement,
   request_status
from (
  select t1.sr_id,
    t2.sm_id,
    t2.segmentation_name,
    t1.last_updated_date,
    t2.sql_statement,
     t1.request_status,
    max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date
  from test_4 t4
  join answer_extraction t3 on t3.q_id = t4.q_id
  join test_2 t2 on t2.sm_id = t3.sm_id
  join test_1 t1 on t1.sr_id = t3.sr_id
)
where last_updated_date = max_updated_date));

}

示例数据：
接受的答案给出了以下输出，其中包含该段的 max(last_updated_date) 。

1097661 Submitted   o2k lad 30-NOV-15   01-DEC-16   62  CLOB DATA

以上发布的查询 GIVES 下面的输出是 sr_id 具有其他更新日期的段。

 1097621    o2k lad Submitted
    1097625 o2k lad Submitted
    1097627 o2k lad Submitted
    1097632 o2k lad Submitted
    1097633 o2k lad Submitted
    1097658 o2k lad Pending
    1097640 o2k lad Submitted
    1097644 o2k lad Submitted
    1097646 o2k lad Submitted

预期输出：

  sr_id status     segment_name updated_date sql_statement other_sr_id
1097661 Submitted   o2k lad     30-NOV-15     CLOB DATA 1097618,1097621,1097625,1097627,1097632,1097633,1097658,1097640,1097644,1097646

合并两个查询，使最后一列包含所有旧 sr_id。

Answer 1

一个相当简单的选择是修改您当前的查询以添加一个分析函数来查找每个 ID 的最大日期，例如：

..., max(mst.last_updated_date) over (partition by id) as max_updated_date

总体思路的快速演示：

with cte (id, last_updated_date, sql_statement) as (
  select 1, date '2017-01-01', to_clob('stmt 1') from dual
  union all select 1, date '2017-01-02', to_clob('stmt 2') from dual
  union all select 1, date '2017-01-03', to_clob('stmt 3') from dual
  union all select 2, date '2017-01-02', to_clob('stmt 4') from dual
)
select id, last_updated_date, sql_statement
from (
  select id, last_updated_date, sql_statement,
    max(last_updated_date) over (partition by id) as max_updated_date
  from cte
)
where last_updated_date = max_updated_date;

        ID LAST_UPDAT SQL_STATEMENT                                                                   
---------- ---------- --------------------------------------------------------------------------------
         1 2017-01-03 stmt 3                                                                          
         2 2017-01-02 stmt 4

您可以使用 row_number() 或 rank() 或 dense_rank() 来识别具有最早日期的行并对其进行过滤，但总体思路是相同的。

但是，您当前的查询一开始就不是很清楚（或在 12c 之前有效）。与其尝试猜测如何包含这样的函数和过滤器，不如从您的基表重新开始可能更简单，尽管这对您正在做的事情做出了很多假设，并且可能会忽略一些事情——比如左连接和右连接- 这可能需要也可能不需要。

通过 CTE 编造一些数据：

with test_1 (sr_id, last_updated_date) as (
  select 1108197, timestamp '2017-02-05 23:56:59' from dual
  union all select 1108217, timestamp '2017-02-14 00:37:39' from dual
  union all select 1108218, timestamp '2017-02-14 01:39:50' from dual
  union all select 1108220, timestamp '2017-02-14 03:39:07' from dual
),
test_2 (sm_id, segmentation_name, sql_statement) as (
  select 958, 'test_not_in', to_clob('select * from dual') from dual
),
test_3 (q_id, sr_id, answer) as (
  select 41, 1108197, 958 from dual
  union all select 42, 1108217, 958 from dual
  union all select 43, 1108218, 958 from dual
  union all select 44, 1108220, 958 from dual
),
test_4 (q_id, field_id) as (
  select 41, 'LM_LRE_Q6' from dual
  union all select 42, 'LM_LRE_Q6' from dual
  union all select 43, 'LM_LRE_Q6' from dual
  union all select 44, 'LM_LRE_Q6' from dual
)

然后这会得到与您在问题中显示的相同的输出：

select t1.sr_id,
  t2.sm_id,
  t2.segmentation_name,
  to_char(t1.last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
  t2.sql_statement
from test_4 t4
join test_3 t3 on t3.q_id = t4.q_id
join test_2 t2 on t2.sm_id = t3.answer
join test_1 t1 on t1.sr_id = t3.sr_id;

     SR_ID SM_ID SEGMENTATIO LAST_UPDATED_DATE             SQL_STATEMENT                                                                   
---------- ----- ----------- ----------------------------- --------------------------------------------------------------------------------
   1108197   958 test_not_in 05-feb-2017 23:56:59          select * from dual                                                              
   1108217   958 test_not_in 14-feb-2017 00:37:39          select * from dual                                                              
   1108218   958 test_not_in 14-feb-2017 01:39:50          select * from dual                                                              
   1108220   958 test_not_in 14-feb-2017 03:39:07          select * from dual

根据接近右边的疯狂假设，您可以找到每个 sm_id 的最新日期的行，如下所示：

select sr_id,
  sm_id,
  segmentation_name,
  to_char(last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
  sql_statement
from (
  select t1.sr_id,
    t2.sm_id,
    t2.segmentation_name,
    t1.last_updated_date,
    t2.sql_statement,
    max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date
  from test_4 t4
  join test_3 t3 on t3.q_id = t4.q_id
  join test_2 t2 on t2.sm_id = t3.answer
  join test_1 t1 on t1.sr_id = t3.sr_id
)
where last_updated_date = max_updated_date;

     SR_ID SM_ID SEGMENTATIO LAST_UPDATED_DATE             SQL_STATEMENT                                                                   
---------- ----- ----------- ----------------------------- --------------------------------------------------------------------------------
   1108220   958 test_not_in 14-feb-2017 03:39:07          select * from dual

您需要对其进行调整以处理任何其他不明确的限制或要求（例如，包括您的 left/right 外部联接）。

我故意忽略了您正在执行的将 'answer' 拆分为多个值的子查询。可能你有一些可怕的东西，比如里面有一个分隔的 ID 列表，这是一个数据模型问题。如果是这种情况，那么您仍然需要提取单个 sm_id 值；类似于：

with answer_extraction as (
  select q_id, sr_id, regexp_substr(answer, '\d+', 1, level) as sm_id
  from test_3
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '\d+', 1, level) is not null
)
select sr_id,
  sm_id,
  segmentation_name,
  to_char(last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
  sql_statement
from (
  select t1.sr_id,
    t2.sm_id,
    t2.segmentation_name,
    t1.last_updated_date,
    t2.sql_statement,
    max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date
  from test_4 t4
  join answer_extraction t3 on t3.q_id = t4.q_id
  join test_2 t2 on t2.sm_id = t3.sm_id
  join test_1 t1 on t1.sr_id = t3.sr_id
)
where last_updated_date = max_updated_date;

根据您添加的 test3 的实际内容，您的正则表达式没有完全满足您的需要。使用您正在使用的模式，它会找到 14 个数值，即任何数字：

with test_3 (q_id, sr_id, answer) as (
  select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
  union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
  union all select 1009330, 1108239, '972~test_emea' from dual
  union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
  union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
),
answer_extraction as (
  select q_id, sr_id, regexp_substr(answer, '\d+', 1, level) as sm_id
  from test_3
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '\d+', 1, level) is not null
)
select * from answer_extraction;

      Q_ID      SR_ID SM_ID     
---------- ---------- ----------
   1009330    1108239 972       
   1009330    1108240 972       
   1009330    1108240 827       
   1009330    1108246 976       
   1009330    1108246 24        
   1009330    1108246 941       
   1009330    1108247 941       
   1009330    1108247 2016      
   1009330    1108247 787       
   1009330    1108247 28        
   1009330    1108247 976       
   1009330    1108247 24        
   1009330    1108251 981       
   1009330    1108251 100579729

看来您只需要 ^ 分隔符和 ~ 标记之间的位。分割定界字符串的常用方法是：

with test_3 (q_id, sr_id, answer) as (
  select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
  union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
  union all select 1009330, 1108239, '972~test_emea' from dual
  union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
  union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
),
answer_extraction as (
  select q_id, sr_id, regexp_substr(answer, '[^^]+', 1, level) as sm_id
  from test_3
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '[^^]+', 1, level) is not null
)
select * from answer_extraction;

      Q_ID      SR_ID SM_ID                                   
---------- ---------- ----------------------------------------
   1009330    1108239 972~test_emea                           
   1009330    1108240 972~test_emea                           
   1009330    1108240 827~test_with_region_country            
   1009330    1108246 976~feb_24                              
   1009330    1108246 941~Test_regionwithcountry              
   1009330    1108247 941~Test_regionwithcountry_2016         
   1009330    1108247 787~Test_Request_28                     
   1009330    1108247 976~feb_24                              
   1009330    1108251 981~MSE100579729 testing.

但是你需要得到它的第一部分，例如借用你原来的模式（其他的也可以！）：

column sm_id format a10
with test_3 (q_id, sr_id, answer) as (
  select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
  union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
  union all select 1009330, 1108239, '972~test_emea' from dual
  union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
  union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
),
answer_extraction as (
  select q_id, sr_id,
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level), '\d+') as sm_id
  from test_3
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '[^^]+', 1, level) is not null
)
select * from answer_extraction;

      Q_ID      SR_ID SM_ID     
---------- ---------- ----------
   1009330    1108239 972       
   1009330    1108240 972       
   1009330    1108240 827       
   1009330    1108246 976       
   1009330    1108246 941       
   1009330    1108247 941       
   1009330    1108247 787       
   1009330    1108247 976       
   1009330    1108251 981

注意额外的 regexp_substr() 仅在 select 列表中，不 connect-by 子句；并且提取物 sm_id 仍然是一个字符串。如果 test_2.sm_id 是一个数字，那么也在 select 列表中的一对子字符串周围添加一个 to_number() 调用。

涉及 CLOB 数据的 GROUP BY

GROUP BY invloving a CLOB data

sql

oracle

group-by

clob