删除正则表达式匹配后得到剩下的

Getting what's left after removing regex match

上下文是 SQL AS/400 (IBM i)

我的目标是最终得到两个值:一个由我已有的正则表达式确定的字符串,然后是源字符串中的所有其他内容,其中正则表达式的结果被删除并且间隙(如果有的话)被关闭。

这是SQL:

select HAD1,                                                   
 regexp_substr(HAD1,'\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}'),
 regexp_substr(HAD1,'**eventual_regex_goes_here**')                
from ECH                                                       
where regexp_like(HAD1,'\bGATE')                               

期望的结果:

Ship To                              REGEXP_SUBSTR     REGEXP_SUBSTR       
Address                                                                 
D2 COMPOUND, GATE 11                 GATE 11           D2 COMPOUND,  
2/22 GATEWAY DRIVE                   -                 2/22 GATEWAY DRIVE  
ASHBURTON FITTINGS  GATE 2           GATE 2            ASHBURTON FITTINGS
BRIERLY RD, GATE A, RIVER SIDE       GATE A            BRIERLY RD, , RIVER SIDE  
GATE 16, 37 KENEPURU DRIVE           GATE 16           , 37 KENEPURU DRIVE  

如果第二个表达式也可以去掉逗号,那就太好了,但这不是必需的。剩余的字符串将通过其他 (non-regex) 处理以删除无关元素(phone 数字、注释、标点符号等)

看板软件建议的最接近的帖子是 ,它给出了以下字符串:

^.+?(?=\d{2})|(?<=\d{2}).+$

所以,首先我尝试用我的整个表达式代替两次出现的 \d{2} 并发现这(毫不奇怪)不会处理。然后我回到更基本的测试并尝试从那里开始。

让我们尝试将 GATE 这个词作为常量,加上几个边界(因为在内心深处我仍然只是一个 child,你知道他们怎么说:“Children 需要边界”)。

select had1,                                                   
 regexp_substr(HAD1,'\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}'),
 regexp_substr(HAD1,'^.+?(?=\bGATE\b)|(?<=\bGATE\b).+$')               
from ech                                                       
where regexp_like(HAD1,'\bGATE')      

结果:

Ship To                                   REGEXP_SUBSTR     REGEXP_SUBSTR                 
Address                                                                                   
GATE 3, CNR QUARRY ROAD                   GATE 3             3, CNR QUARRY ROAD           
ASHBURTON FITTINGS  GATE 2                GATE 2            ASHBURTON FITTINGS            
GATE 6, HELLABYS ROAD                     GATE 6             6, HELLABYS ROAD             
GATE 3, 548 PAKAKARIKI HILL               GATE 3             3, 548 PAKAKARIKI HILL       
GATE 5 - FLIGHTYS COMPOUND                GATE 5             5 - FLIGHTYS COMPOUND        
GATE 3 - 548 PAEKAKARIKI HILL ROAD        GATE 3             3 - 548 PAEKAKARIKI HILL ROAD
GATE 14 - TAKAPU COMPOUND                 GATE 14            14 - TAKAPU COMPOUND         
35 GATEWAY DRIVE                          -                 -                             
GATE 6                                    GATE 6             6                            
TAKAPU ROAD,GATE 20,SH1                   GATE 20            TAKAPU ROAD,

这看起来很有希望,请记住我没有对第二个结果列使用完整表达式。但是已经有一点不对了。

第二行和最后一行应该有更多数据,分别是“2”和“,SH1”。字符串“35 GATEWAY DRIVE”应该在最后一列。我想要 一切 除了表达式找到的内容(记住,此刻只是整个单词 GATE)。

似乎可以 return 删除文本的一侧或另一侧的剩余文本,但不能同时从两侧删除,如果没有发现要删除的内容,则不能删除所有剩余文本。因此,在我理解为什么我没有得到所有不是 GATE 的文本之前,我没有必要继续添加更复杂的内容以包括门号。因此,我会在这里暂停并寻求帮助。

你可以试试这个:

with data (s) as (values
('D2 COMPOUND, GATE 11'), 
('2/22 GATEWAY DRIVE'),
('ASHBURTON FITTINGS  GATE 2'),
('BRIERLY RD, GATE A, RIVER SIDE'),
('GATE 16, 37 KENEPURU DRIVE')
) 
select s,
       regexp_substr(s,' ?(GATE|LEVEL|DOOR|UNITS) '),
       replace(regexp_replace(s,' ?(GATE|LEVEL|DOOR|UNITS) ',''),',',' ')
from   data

结果:

D2 COMPOUND, GATE 11             GATE   D2 COMPOUND 11
2/22 GATEWAY DRIVE                 -    2/22 GATEWAY DRIVE
ASHBURTON FITTINGS  GATE 2       GATE   ASHBURTON FITTINGS 2
BRIERLY RD, GATE A, RIVER SIDE   GATE   BRIERLY RD A  RIVER SIDE
GATE 16, 37 KENEPURU DRIVE       GATE   16  37 KENEPURU DRIVE

我已经选择了用户2398621友情提供的正确答案。

但是,对于那些在家玩的人来说,这里是 full-fat nearly-application-ready 答案在将要应用的数据的上下文中。注释被括起来像 /* this */

select distinct HAD1 ,                                              
regexp_substr(HAD1 ,'\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}'),     
trim(     /* remove leading/trailing blanks from REPLACE func */    
replace(   /* replace commas  */                                    
 replace(   /* replace slashes */                                   
  replace(   /* replace dashes  */                                  
   regexp_replace(HAD1 ,'\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}',  
                  '')  /* replace extra address detail with null */ 
          ,'-',' ')                                                 
         ,'/',' ')                                                  
        ,',',' ')                                                   
     )                                                              
from ECH                                                            
where regexp_like(HAD1 ,'\b(GATE|LEVEL|DOOR|UNITS?)\b')             
  and length(trim(HAD1 )) > 12   /* show only longish addresses in sample */                                   

示例 GATE 条目

Ship To                                   REGEXP_SUBSTR    REGEXP_REPLACE    
Address                                                                       
GATE 6 52 MAHIA ROAD                      GATE 6           52 MAHIA ROAD        
ASHBURTON FITTINGS  GATE 2                GATE 2           ASHBURTON FITTINGS   
FIRST GATE AFTER THE ROUNDABOUT           GATE AFTER       FIRST  THE ROUNDABOUT
GATE 2,  61-63 NORMANBY ROAD              GATE 2           61 63 NORMANBY ROAD  
GATE 7, OFF MORRING STREET                GATE 7           OFF MORRING STREET   
GATE 7 OFF MORRIN STREET                  GATE 7           OFF MORRIN STREET    
GATE 6 SUBSTATION ROAD                    GATE 6           SUBSTATION ROAD      
VIA GATE 4, BUILDING 108                  GATE 4           VIA   BUILDING 108   

LEVEL 条目示例(请注意第一行空白 REGEXP_REPLACE 是正确的,因为 LEVEL 和 UNIT(及其编号)都已被删除)

Ship To                                   REGEXP_SUBSTR    REGEXP_REPLACE
Address                                                          
LEVEL 2  UNIT 16                          LEVEL 2                                      
TRANSPOWER HOUSE - LEVEL 8                LEVEL 8          TRANSPOWER HOUSE             
LEVEL 3/27 NAPIER STREET                  LEVEL 3          27 NAPIER STREET             
LEVEL 2 GRAHAM STREET SERVICE CENTRE      LEVEL 2          GRAHAM STREET SERVICE CENTRE 
LEVEL 1 - MATT WILES                      LEVEL 1          MATT WILES                   
ANZ CENTRE, LEVEL 2                       LEVEL 2          ANZ CENTRE                   

DOOR 条目示例

Ship To                                   REGEXP_SUBSTR   REGEXP_REPLACE
Address                                                               
NEXT DOOR TO 201                          DOOR TO         NEXT  201    
WAREHOUSE DOOR A                          DOOR A          WAREHOUSE    
DOOR 11 ( WAREHOUSE)                      DOOR 11         ( WAREHOUSE) 
DOOR 11 (WAREHOUSE)                       DOOR 11         (WAREHOUSE)  

示例 UNIT 条目

Ship To                                   REGEXP_SUBSTR   REGEXP_REPLACE
Address                                                               
UNIT B 11 LANGSTONE LANE                  UNIT B          11 LANGSTONE LANE   
26 BELFAST ROAD UNIT 1                    UNIT 1          26 BELFAST ROAD     
UNIT C 589 TERMAINE AVE                   UNIT C          589 TERMAINE AVE  
UNIT 1, 3 HENRY ROSE PLACE                UNIT 1          3 HENRY ROSE PLACE
UNIT 1/12 ANVIL ROAD                      UNIT 1          12 ANVIL ROAD     
UNIT D1, 269A MT SMART ROAD               UNIT D1         269A MT SMART ROAD

您会注意到仍然存在一些异常情况,即使是在这个小样本中 - 例如有时删除表达式选择的文本会留下无意义的剩余部分,有时我们需要删除的破折号,等等,但是我将手动修改 5% 而不是手动修改 95% 的需要注意的案例。

我知道您已经将答案标记为正确,但这里没有所有这些 replace。不同之处在于我在初始 REGEX 的两边选择了空格和逗号以替换为单个空格,然后如果该空格引导或尾随字符串,我 trim 将其关闭,如下所示:

CREATE TABLE strtest
  (string   varchar(255));

INSERT INTO strtest
VALUES ('D2 COMPOUND, GATE 11'),
       ('2/22 GATEWAY DRIVE'),
       ('ASHBURN FITTINGS  GATE 2'),
       ('BRIERLY RD, GATE A, RIVER SIDE'),
       ('GATE 16, 37 KENEPURU DRIVE')

select STRING,                                                   
       regexp_substr(STRING,'\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}'),
       TRIM(regexp_REPLACE(STRING,'[ ,/-]*\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}[ ,/-]*', ' '))
  from STRTEST                                                       
  where regexp_like(STRING,'\bGATE')
|STRING                                 |REGEXP_SUBSTR  |REGEX_REPLACE           |
|---------------------------------------|---------------|------------------------|
|D2 COMPOUND, GATE 11                   |GATE 11        |D2 COMPOUND             |
|2/22 GATEWAY DRIVE                     |               |2/22 GATEWAY DRIVE      |
|ASHBURN FITTINGS  GATE 2               |GATE 2         |ASHBURN FITTINGS        |
|BRIERLY RD, GATE A, RIVER SIDE         |GATE A         |BRIERLY RD RIVER SIDE   |
|GATE 16, 37 KENEPURU DRIVE             |GATE 16        |37 KENEPURU DRIVE       |
|LEVEL 3/27 NAPIER STREET               |LEVEL 3        |27 NAPIER STREET        |
|LEVEL 1 - MATT WILES                   |LEVEL 1        |MATT WILES              |

神奇之处在于我添加到 REGEXP 开头和结尾的 [ ,]* 表达式。如果你也想获得破折号和斜杠,只需将 [ -,/]*.

您仍然有那些麻烦的 DOOR TOGATE AFTER 条目,但它们很少,您以后可能会更正它们。