如何将 regexp_count 与 regexp_substr 一起使用以在 SQL (Redshift) 中为每个字符串输出多个匹配项？

Question

我有一个 table 包含一个字符串列。我想提取每个字符串中紧跟在某个子字符串之后的所有文本。对于这个最小的可重现示例，我们假设此子字符串是 abc。所以我想要 abc.

之后的所有后续术语

在每行只有 1 个 abc 的情况下，我能够实现这一点，但是当有多个 abc 时，我的逻辑就会失败。我也得到了子串出现的次数，但我无法将其与检索所有这些出现相关联。

我的approach/attempt:

我创建了一个临时文件 table，其中包含我的主字符串中成功匹配正则表达式的次数：

CREATE TEMP TABLE match_count AS (
SELECT DISTINCT id, main_txt, regexp_count(main_txt, 'abc (\S+)', 1) AS cnt
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\S+)', 1) > 0);

我的输出：

id   main_txt                         cnt
1    wpfwe abc weiofnew abc wieone    2
2    abc weoin                        1
3    abc weoifn abc we abc w          3

为了得到我的最终输出，我有一个查询：

SELECT id, main_txt, regexp_substr(main_txt, 'abc (\S+)', 1, cnt, 'e') AS output
FROM match_count;

我的实际最终输出：

id   main_txt                         output
1    wpfwe abc weiofnew abc wieone    wieone
2    abc weoin                        weoin
3    abc weoifn abc we abc w          w

我预期的最终输出：

id   main_txt                         output
1    wpfwe abc weiofnew abc wieone    weiofnew
1    wpfwe abc weiofnew abc wieone    wieone
2    abc weoin                        weoin
3    abc weoifn abc we abc w          weoifn
3    abc weoifn abc we abc w          we
3    abc weoifn abc we abc w          w

所以我的代码只得到最后的匹配项（出现# = cnt）。我如何修改它以包含所有匹配项？

Answer 1

解决这个问题的一种方法是使用递归 CTE 为每个字符串制作一个匹配编号列表（因此如果有 2 个匹配项，它会生成其中包含 1 和 2 的行），然后将它们连接起来回到主 table 作为 occurrence 参数到 regexp_substr:

WITH RECURSIVE match_counts(id, match_count) AS (
  SELECT DISTINCT id, regexp_count(main_txt, 'abc (\S+)', 1)
  FROM my_data_source
  WHERE regexp_count(main_txt, 'abc (\S+)', 1) > 0
),
match_nums(id, match_num, match_count) AS (
  SELECT id, 1, match_count
  FROM match_counts
  UNION ALL
  SELECT id, match_num + 1, match_count
  FROM match_nums
  WHERE match_num < match_count
)
SELECT m.id, main_txt, regexp_substr(main_txt, 'abc (\S+)', 1, match_num, 'e') AS output
FROM my_data_source m
JOIN match_nums n ON m.id = n.id
ORDER BY m.id, n.match_num

不幸的是，我无法访问 Redshift 来对此进行测试，但是我已经在 Oracle（具有类似的正则表达式函数）上对其进行了测试并且它可以在那里工作： Oracle demo on dbfiddle。请注意，Oracle 不支持 regexp_substr 的 e 参数，因此 returns 整个匹配项而不是组。（编辑 - 它已被确认也适用于 Redshift，感谢@HaleemurAli）。

注意如果分隔符 abc 可能合法地出现在单词的末尾，您应该在正则表达式的开头添加分词符（即 \babc (\S+)) 以防止它匹配（例如）deabc.

Answer 2

下面的解决方案无法处理 main_text 连续出现 abc 的情况。

例如

wpfwe abc abc abc weiofnew abc wieone

设置

CREATE TABLE test_hal_unnest (id int, main_text varchar (500));
INSERT INTO test_hal_unnest VALUES 
(1, 'wpfwe abc weiofnew abc wieone'),
(2, 'abc weoin'),
(3, 'abc weoifn abc we abc w');

将字符串拆分为单词的可能解决方案

假设您要搜索字符串中单词 abc 之后的所有单词，则不一定要使用正则表达式。不幸的是，redshift 中的正则表达式支持不如 postgres 或其他一些数据库那么全面。例如，您无法将与正则表达式模式匹配的所有子字符串提取到数组中，或根据正则表达式模式将字符串拆分为数组。

步数：

用分隔符 ' '

text to array

unnest array with ordinality
使用 LAG 查找前一个数组元素，按单词索引排序
筛选前一个单词为 abc

额外的列 idx & prev_word 留在最终输出中以说明问题是如何解决的。它们可能会毫无问题地从最终查询中删除

WITH text_split AS (
SELECT Id
, main_text
, SPLIT_TO_ARRAY(main_text, ' ') text_arr
FROM test_hal_unnest
)
, text_unnested AS (
SELECT ts.id
, ts.main_text
, ts.text_arr
, CAST(ta as VARCHAR) text_word -- converts super >> text
, idx -- this is the word index
FROM text_split ts
JOIN ts.text_arr ta AT idx 
  ON TRUE
-- ^^ array unnesting happens via joins

)
, with_prevword AS (
SELECT id
, main_text
, idx
, text_word
, LAG(text_word) over (PARTITION BY id ORDER BY idx) prev_word
FROM text_unnested
ORDER BY id, idx
)
SELECT *
FROM with_prevword
WHERE prev_word = 'abc';

输出：

 id |           main_text           | idx | text_word | prev_word
----+-------------------------------+-----+-----------+-----------
  1 | wpfwe abc weiofnew abc wieone |   2 | weiofnew  | abc
  1 | wpfwe abc weiofnew abc wieone |   4 | wieone    | abc
  2 | abc weoin                     |   1 | weoin     | abc
  3 | abc weoifn abc we abc w       |   1 | weoifn    | abc
  3 | abc weoifn abc we abc w       |   3 | we        | abc
  3 | abc weoifn abc we abc w       |   5 | w         | abc
(6 rows)

关于带序数的 unnest 数组的注意事项

引用 redshift documentation 关于这个话题，因为它有点隐蔽

Amazon Redshift also supports an array index when iterating over the array using the AT keyword. The clause x AS y AT z iterates over array x and generates the field z, which is the array index.

在 `abc`

上拆分的替代较短解决方案

使用 redsfhit 中可用的正则表达式功能可以更轻松地解决此问题，如果不是

1, wpfwe abc weiofnew abc wieone

源数据已在 abc

上拆分为多行

1, wpfwe
1, abc weiofnew
1, abc wieone

此解决方案首先通过拆分 abc 来扩展源数据。然而，由于 split_to_array 不接受正则表达式模式，我们首先在 abc 之前注入一个分隔符 ;，然后在 ;.

上拆分

任何定界符都可以使用，只要保证它不会出现在 main_text

列中

WITH text_array AS (
SELECT
  id
, main_text
, SPLIT_TO_ARRAY(REGEXP_REPLACE(main_text, 'abc ', ';abc '), ';') array
FROM test_hal_unnest
)
SELECT
  ta.id
, ta.main_text
, REGEXP_SUBSTR(CAST(st AS VARCHAR), 'abc (\S+)', 1, 1, 'e') output
FROM text_array ta
JOIN ta.array st ON TRUE
WHERE st LIKE 'abc%';

如何将 regexp_count 与 regexp_substr 一起使用以在 SQL (Redshift) 中为每个字符串输出多个匹配项？

How to use regexp_count with regexp_substr to output multiple matches per string in SQL (Redshift)?

regex

sql

string

amazon-redshift

regexp-substr

设置

将字符串拆分为单词的可能解决方案

关于带序数的 unnest 数组的注意事项

在 `abc`

如何将 regexp_count 与 regexp_substr 一起使用以在 SQL (Redshift) 中为每个字符串输出多个匹配项？

How to use regexp_count with regexp_substr to output multiple matches per string in SQL (Redshift)?

regex

sql

string

amazon-redshift

regexp-substr

设置

将字符串拆分为单词的可能解决方案

关于带序数的 unnest 数组的注意事项

在 abc

在 `abc`