如何使用正则表达式识别特定模式之间的单词：Oracle？

Question

我有一个文本字段。我需要识别模式 <a href 和 a>.

之间的单词

这个图案可以在正文的beginning/end/mid处。

with t as (
select '<a href Part of the technical Network Group www.tech.com/sites/ hh a>' as text from dual
union select '<a href www.tech.technical Network a>' as text from dual union
select 'www.tech.tech///technical <a href Network Group a>' as text from dual)
select * from t
WHERE REGEXP_LIKE(text,'(^|\W)<a href\S*','i')

这给了我前两行正确的结果。但是我需要检查 'group' 这个词（不区分大小写）。我们如何检查单词 'group' 以及该单词应该在模式中。在这种情况下，应该返回第 1 行和第 3 行。

Answer 1

搜索完整的模式，然后在该模式的子字符串中搜索单词 Group。如果文本中有多个匹配项，那么您可以使用递归子查询分解子句来找到它们：

Oracle 设置:

CREATE TABLE table_name ( id, text ) AS
select 1, '<a href Part of the technical Network Group www.tech.com/sites/ hh a>' from dual union all
select 2, '<a href www.tech.technical Network a>' from dual union all
select 3, 'www.tech.tech///technical <a href Network Group a>' from dual union all
select 4, '<a hrefgroup a>' FROM DUAL UNION ALL
select 5, '<a href groupa>' FROM DUAL UNION ALL
select 6, '<a href workgroup a>' FROM DUAL UNION ALL
select 7, '<a href test1 a> Group <a href test2 a>' FROM DUAL;

查询:

WITH positions ( id, text, match, position ) AS (
  SELECT id,
         text,
         REGEXP_SUBSTR(
           text,
           '(^|\W)<a href\s+.*?\s+a>(\W|$)',
           1,
           1,
           'i'
         ),
         REGEXP_INSTR(
           text,
           '(^|\W)<a href\s+.*?\s+a>(\W|$)',
           1,
           1,
           0,
           'i'
         )
  FROM   table_name
UNION ALL
  SELECT id,
         text,
         REGEXP_SUBSTR(
           text,
           '(^|\W)<a href\s+.*?\s+a>(\W|$)',
           position + 1,
           1,
           'i'
         ),
         REGEXP_INSTR(
           text,
           '(^|\W)<a href\s+.*?\s+a>(\W|$)',
           position + 1,
           1,
           0,
           'i'
         )
  FROM   positions
  WHERE  position > 0
)
SELECT id,
       text
FROM   positions
WHERE  REGEXP_LIKE( match, '\sGroup\s', 'i' );

输出:

ID | TEXT                                                                 
-: | :--------------------------------------------------------------------
 1 | <a href Part of the technical Network Group www.tech.com/sites/ hh a>
 3 | www.tech.tech///technical <a href Network Group a>

db<>fiddle here

Answer 2

您可以像这样扩展您的正则表达式：<a href.*group.*a>。

Demo on DB Fiddle:

with t as (
    select '<a href Part of the technical Network Group www.tech.com/sites/ hh a>' as text from dual
    union all select '<a href www.tech.technical Network a>' as text from dual
    union all select 'www.tech.tech///technical <a href Network Group a>' as text from dual)
select * from t
WHERE REGEXP_LIKE(text,'<a href.*group.*a>','i')

| TEXT                                                                  |
| :-------------------------------------------------------------------- |
| <a href Part of the technical Network Group www.tech.com/sites/ hh a> |
| www.tech.tech///technical <a href Network Group a>                    |

注意：只要您的文本仅包含一个 <a href ... a> 模式，这在您的示例数据中就是这种情况。

您可以改进正则表达式以确保它只匹配单词 'group'（而不匹配包含 'group' 的其他单词，如 'workgroup' 或 'grouped'） :

<a href.*\sgroup\s.*a>

只要 <a href 后面总是跟着 space 并且 a> 前面总是 space。

Demo on DB Fiddle

如何使用正则表达式识别特定模式之间的单词：Oracle？

How to identify the words in between a particular pattern using regexp: Oracle?

regex

sql

oracle

string-matching

regexp-like