使用 Oracle SQL 正则表达式的子字符串

Question

我创建了一个正则表达式来捕获我需要的字符串。当我在 rubular.com 等网站上测试正则表达式时，一切正常，但是当我将相同的正则表达式放入 REGEXP_SUBSTR 函数时，它就不起作用了。

这里有 2 个 SQL 示例（一个是英文文本，另一个是 Kristaps Porzingis 的语言）：

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">This is the text I would like to substr! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

我正在尝试 select <li class="b"> 和下一个 html 标签之间的文本，在本例中是 <p class="tutorial">.

对我做错了什么有什么建议吗？

Answer 1

您可以简化该正则表达式。
不是查找特定字符，而是查找不是 < 或 >

的字符

例如：

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">([^<>]+)',1,1,'i',1) as b_class
FROM dual

其中 [^<>] 匹配任何非 < 或 >

的字符

或者你可以延迟匹配字符直到第一个 <

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">(.*?)<',1,1,'ni',1) as b_class
FROM dual

.*? 将尝试消耗字符直到第一个 <
通过添加 match parameter n 如果标签后有多行文本，它也会匹配。

'n' allows the period (.), which is the match-any-character character, to match the newline character. If you omit this parameter, the period does not match the newline character.

Answer 2

不建议使用正则表达式解析 HTML，您最好获取字符串并使用一种可以方便地解析 HTML.

的语言来解析它们

如果您手头只有 Oracle DBMS，对于一次性工作，您可以考虑使用以下 regexp_substr：

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>',
      '<li\s+class="b">([^<]+)', 1, 1, NULL, 1) as RESULT from dual

见REXTESTER demo:

这里，

<li\s+class="b"> - 匹配 <li、1+ 个空格、class="b"> 文字子字符串
([^<]+) - 将捕获到第 1 组中 <

最后一个 1 参数允许您访问第 1 组的内容。

Answer 3

我会使用 Instr 来搜索第一个 html 标签的位置，然后在该位置之后做一个子字符串来获取文本的尾部。下一步是在此尾部搜索“<”并再次使用子字符串。

类似于

select substring(mytext, 1, instr(mytext, '<')) from 
(
 select substring(text, instr(text, '<li class="b">') + 
 length('<li class="b">') +1) as mytext from table
)

使用 Oracle SQL 正则表达式的子字符串

Substring using Oracle SQL regex

sql

regex

oracle

regexp-substr