Select 第一个、最后一个或两个字符都是特殊字符或标点符号的行,除非它们末尾只有句点

Select rows where first, last or both characters are special or punctuation, unless they only have a period at the end

我需要从我的 table 中检索名称以 [:space:] 或其他特殊字符 [:punct:] 开头或结尾的行,不包括单个点 (.) 在名称的末尾。这个想法是提取可能不一致的名称。

必须出现的例子:

  1. 'GEORGE & SON ' - 最后有一个额外的 space。
  2. '-GEORGE & SON' - 开头有一个额外的 -
  3. '&GEORGE & SON' - 开头有一个额外的 &
  4. '-GEORGE & SON S.A.' - 开头有一个额外的 -。末尾的点.没有问题。
  5. 'GEORGE & SON..' - 最后不是一个点,而是两个点。对于以多个 . 结尾的字符串,这是一个例外;他们也是坏名字。

不能出现的例子:

  1. 'GEORGE & SON.' - 最后只有一个额外的'.'。

我正在使用表达式:

REGEXP_LIKE(col, '(^[[:punct:]]|[[:punct:]]$)|(^[[:space:]]|[[:space:]]$)')

但是尽管检索了以 space 或特殊字符开头或结尾的名称,但也会提取带有点 '.' 的名称。作为最后一个字符。

我怎样才能改变它以获得我需要的结果?

只需在第二个 [[:punct:]] 之后添加 {2} 。这意味着该点应该至少出现 2 次

with tab as(
  select 'GEORGE & SON ' as s from dual union all
  select '-GEORGE & SON'  as s from dual union all
  select '&GEORGE & SON'  as s from dual union all
  select 'GEORGE & SON..'  as s from dual union all
  select 'GEORGE & SON.'  as s from dual union all
  select '-GEORGE & SON S.A.' as s from dual  
)
select * from  tab 
where REGEXP_LIKE(s, '(^[[:punct:]]|[[:punct:]]{2}$)|(^[[:space:]]|[[:space:]]$)') 

由于预定义的标点符号class对字符串的结尾不起作用,因此使用自定义字符class代替。故意漏掉那个点。单独添加单引号(因为转义它不起作用并且在这种情况下可能很难为 q 运算符找到正确的字符)。自行添加右方括号,因为 Oracle 在转义时似乎无法正确处理它。最后明确添加尾随的连续点:

WITH T (id, col) AS (
  SELECT 1, 'GEORGE & SON ' FROM DUAL UNION ALL
  SELECT 2, '-GEORGE & SON'  FROM DUAL UNION ALL
  SELECT 3, '&GEORGE & SON'  FROM DUAL UNION ALL
  SELECT 4, 'GEORGE & SON..'  FROM DUAL UNION ALL
  SELECT 5, 'GEORGE & SON.'  FROM DUAL UNION ALL
  SELECT 6, '-GEORGE & SON S.A.' FROM DUAL UNION ALL
  SELECT 7, 'GEORGE & SON!' FROM DUAL UNION ALL
  SELECT 8, 'GEORGE & SON"' FROM DUAL UNION ALL
  SELECT 9, 'GEORGE & SON#' FROM DUAL UNION ALL
  SELECT 10, 'GEORGE & SON$' FROM DUAL UNION ALL
  SELECT 11, 'GEORGE & SON%' FROM DUAL UNION ALL
  SELECT 12, 'GEORGE & SON&' FROM DUAL UNION ALL
  SELECT 13, 'GEORGE & SON(' FROM DUAL UNION ALL
  SELECT 14, 'GEORGE & SON)' FROM DUAL UNION ALL
  SELECT 15, 'GEORGE & SON*' FROM DUAL UNION ALL
  SELECT 16, 'GEORGE & SON+' FROM DUAL UNION ALL
  SELECT 17, 'GEORGE & SON,' FROM DUAL UNION ALL
  SELECT 18, 'GEORGE & SON\' FROM DUAL UNION ALL
  SELECT 19, 'GEORGE & SON-' FROM DUAL UNION ALL
  SELECT 20, 'GEORGE & SON\' FROM DUAL UNION ALL
  SELECT 21, 'GEORGE & SON/' FROM DUAL UNION ALL
  SELECT 22, 'GEORGE & SON:' FROM DUAL UNION ALL
  SELECT 23, 'GEORGE & SON;' FROM DUAL UNION ALL
  SELECT 24, 'GEORGE & SON<' FROM DUAL UNION ALL
  SELECT 25, 'GEORGE & SON=' FROM DUAL UNION ALL
  SELECT 26, 'GEORGE & SON>' FROM DUAL UNION ALL
  SELECT 27, 'GEORGE & SON?' FROM DUAL UNION ALL
  SELECT 28, 'GEORGE & SON@' FROM DUAL UNION ALL
  SELECT 29, 'GEORGE & SON[' FROM DUAL UNION ALL
  SELECT 30, 'GEORGE & SON^' FROM DUAL UNION ALL
  SELECT 31, 'GEORGE & SON_' FROM DUAL UNION ALL
  SELECT 32, 'GEORGE & SON`' FROM DUAL UNION ALL
  SELECT 33, 'GEORGE & SON{' FROM DUAL UNION ALL
  SELECT 34, 'GEORGE & SON|' FROM DUAL UNION ALL
  SELECT 35, 'GEORGE & SON}' FROM DUAL UNION ALL
  SELECT 36, 'GEORGE & SON~' FROM DUAL UNION ALL
  SELECT 37, 'GEORGE & SON''' FROM DUAL UNION ALL
  SELECT 38, 'GEORGE & SON]' FROM DUAL)
SELECT
  * FROM T
 WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)')
 ORDER BY id
;

更新要求

标点后跟一个点

在特殊字符集中添加一个可选的点;来自

'[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']$'

'[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$'

WITH T (id, col) AS (
  SELECT 40, 'GEORGE & SON^.'FROM DUAL UNION ALL
  SELECT 41, 'GEORGE & SON_.'FROM DUAL UNION ALL
  SELECT 42, 'GEORGE & SON`.'FROM DUAL UNION ALL
  SELECT 43, 'GEORGE & SON{.'FROM DUAL UNION ALL
  SELECT 44, 'GEORGE & SON|.'FROM DUAL UNION ALL
  SELECT 45, 'GEORGE & SON}.'FROM DUAL UNION ALL
  SELECT 46, 'GEORGE & SON~.'FROM DUAL UNION ALL
  SELECT 47, 'GEORGE & SON''.'FROM DUAL UNION ALL
  SELECT 48, 'GEORGE & SON].'FROM DUAL)
SELECT
  * FROM T
 WHERE REGEXP_LIKE(col, '([-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]\.?$')
 ORDER BY id
;

字符串中空格和特殊字符(的组合)的重复

最初,只要求前导和尾随出现……;-)

两个或更多 space/punctuation 个字符的序列被

捕获
[[:space:][:punct:]]{2,}

如果你想在字符串中明确地使用它,只需 - 用单词字符包围它们:

\w[[:space:][:punct:]]{2,}\w

Leading/trailing 连续的 space 在找到单个时已经匹配 - 无需明确地担心它们。
这给出了:

WITH T (id, col) AS (
  SELECT 50, 'GEORGE & SON  ' FROM DUAL UNION ALL
  SELECT 51, 'GEORGE & SON   '  FROM DUAL UNION ALL
  SELECT 52, '  GEORGE & SON'  FROM DUAL UNION ALL
  SELECT 53, '    GEORGE & SON'  FROM DUAL UNION ALL
  SELECT 54, 'GEORGE &  SON'  FROM DUAL UNION ALL
  SELECT 55, 'GEORGE  & SON S.A.' FROM DUAL UNION ALL
  SELECT 56, 'GEORGE & SON    S.A.' FROM DUAL UNION ALL
  SELECT 60, '  GEORGE and SON'  FROM DUAL UNION ALL
  SELECT 61, ' ,GEORGE and SON' FROM DUAL UNION ALL
  SELECT 62, ', GEORGE and SON'  FROM DUAL UNION ALL
  SELECT 63, 'GEORGE -- SON' FROM DUAL UNION ALL
  SELECT 64, 'GEORGE --SON' FROM DUAL UNION ALL
  SELECT 65, 'GEORGE & SON' FROM DUAL UNION ALL
  SELECT 66, 'GEORGE + SON' FROM DUAL UNION ALL
  SELECT 67, 'GEORGE and  , SON' FROM DUAL UNION ALL
  SELECT 68, 'GEORGE and , SON' FROM DUAL UNION ALL
  SELECT 69, 'GEORGE and SON ,'  FROM DUAL UNION ALL
  SELECT 70, 'GEORGE and SON. '  FROM DUAL UNION ALL
  SELECT 71, 'GEORGE and+-SON'  FROM DUAL)
SELECT
  * FROM T
--  WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)|[[:space:][:punct:]]{2,}')
  WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)|\w[[:space:][:punct:]]{2,}\w')
  ORDER BY id
;

但是这会产生误报,最突出的是GEORGE & SON。在某种程度上,可以通过将 [:punct:] 替换为包容性较低的集合来避免这种情况。 (最终)选择将取决于假阴性或假阳性是否更值得关注。

查看实际效果:

捕获任意序列的标点符号和 space 个字符 - 但允许单个字母后跟一个点和一个空格

如前所述,误报需要与误报进行平衡。一种或另一种方式。 然而,这可能是考虑将整个问题分解为更小的问题并分别处理它们的好时机。即使 GEORGE 和 P.SON 是完全可以接受的,您也可能想要复习,例如 -GEORGE 和 P.SON。因此,让我们关注字符串中间的杂散字符序列——甚至记住之前的 ** & **,并允许枚举(以及逗号):

WHERE
  REGEXP_LIKE(col, '\w[[:space:][:punct:]]{2,}\w')
  AND
  NOT REGEXP_LIKE(col, ' [[:upper:]]\. \w')
  AND
  NOT INSTR(col, ', ') > 0
  AND
  NOT INSTR(col, ' & ') > 0

后面可能是

  WHERE
  REGEXP_LIKE(col, '\w[[:space:][:punct:]]{2,}\w')
  AND
  (REGEXP_LIKE(col, ' [[:upper:]]\. \w')
   OR
   INSTR(col, ', ') > 0
   OR
   INSTR(col, ' & ') > 0
  )

为了在许多有效的之间找到,例如,GEORGE 和 , SONINSTR 可能比 REGEX 更快——取决于整体情况……

关于机制再说几句

(i) [[:punct:][:space:]] 本质上结合了 [[:punct:]][[:space:]] 变成一个字符 class。就从 class 中进行选择而言,顺序无关紧要。

(ii)

[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']

[-!"#$%&()*+,\/:;<=>?@[^_`{|}~]

添加了单引号。如果直接尝试这样做,Oracle 会考虑使用单引号来结束参数值。用反斜杠转义单引号是行不通的……所以基本上,这就是上面所说的 "Adding in the single quote separately".

请评论,如果这需要调整/进一步的细节。