如何在 UPDATE 语句中使用 regexp_matches()？

Question

我正在尝试清理一个 table，它有一个非常混乱的 varchar 列，其中包含各种条目：

<u><font color="#0000FF"><a href="http://virginialidar.com/index-3.html#.VgLbFPm6e73" target="_blank">VA Lidar</a></font></u> OR <u><font color="#0000FF"><a href="https://inport.nmfs.noaa.gov/inport/item/50122" target="_blank">InPort Metadata</a></font></u>

我想通过仅保留 html 链接来更新专栏，如果有多个链接，则用逗号分隔它们。理想情况下我会做这样的事情：

UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');

但不幸的是，这 returns 是 Postgres 10 中的一个错误：

ERROR: set-returning functions are not allowed in UPDATE

我假设regexp_matches()就是所说的集合返回函数。关于如何实现这一点有什么想法吗？

Answer 1

您可以使用相关子查询来处理有问题的集合返回函数（regexp_matches）。像这样：

update mytable
set column = (
    select array_to_string(array_agg(x), ',')
    from (
        select regexp_matches(t2.c, '(?<=href=").+?(?=\")', 'g')
        from t t2
        where t2.id = t.id
    ) dt(x)
)

您仍然受困于 "CSV in a column" 肮脏，但这是一个单独的问题，可能对您来说不是问题。

Answer 2

基于 mu 方法的构建太短，正则表达式和 COALESCE 函数略有不同，无法保留不包含 href 链接的值：

UPDATE a 
SET    bad_data = COALESCE(
  (SELECT Array_to_string(Array_agg(x), ',') 
   FROM   (SELECT Regexp_matches(a.bad_data, 
                                 '(?<=href=")[^"]+', 'g' 
                                ) AS x 
           FROM   a a2 
           WHERE  a2.id = a.id) AS sub), bad_data
);

SQL Fiddle

Answer 3

备注

1.
您不需要将相关子查询基于基础 table 的单独实例（就像到目前为止建议的两个答案）。那将白白做更多的工作。

2.
对于简单的情况，ARRAY 构造函数 比array_agg() 便宜。参见：

Why is array_agg() slower than the non-aggregate ARRAY() constructor?

3.
我使用不带 lookahead and lookbehind constraints 和括号的正则表达式：href="([^"]+)

参见查询 1。

这是有效的，因为 带括号的子表达式 被 regexp_matches()（以及其他几个 Postgres 正则表达式函数）捕获。所以我们可以用简单的括号替换更复杂的约束。 The manual on regexp_match():

If a match is found, and the pattern contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern. If a match is found, and the *pattern* contains parenthesized subexpressions, then the result is a text array whose n'th element is the substring matching the n'th parenthesized subexpression of the pattern

And for regexp_matches():

This function returns no rows if there is no match, one row if there is a match and the g flag is not given, or N rows if there are N matches and the g flag is given. Each returned row is a text array containing the whole matched substring or the substrings matching parenthesized subexpressions of the pattern, just as described above for regexp_match.

4.
regexp_matches()returns一组数组（setof text[]）是有原因的：正则表达式不仅可以在单个字符串中匹配多次（因此set)，它can 还可以为每个带有多个捕获括号的单个匹配生成多个字符串（因此 array). this 正则表达式不会出现，结果中的每个数组都包含一个元素。但以后的读者不要误入歧途：

将生成的一维数组提供给生成二维数组的 array_agg()（或 ARRAY 构造函数）时 - 这甚至是可能的，因为 Postgres 9.5 添加了 array_agg() 的变体接受数组输入。参见：

Is there something like a zip() function in PostgreSQL that combines two arrays?

然而，quoting the manual:

inputs must all have same dimensionality, and cannot be empty or NULL

我认为这永远不会失败，因为相同的正则表达式总是产生相同数量的数组元素。我们总是生成 one 元素。但这可能与其他正则表达式不同。如果是这样，有多种选择：

只取第一个带(regexp_matches(...))[1]的元素。参见 查询 2。
取消嵌套数组并在基本元素上使用 string_agg()。参见 查询 3。

每种方法在这里也适用。

查询 1

UPDATE tbl t
SET    col = (
   SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
   );

没有匹配项的列设置为 ''（空字符串）。

查询 2

UPDATE tbl
SET    col = (
   SELECT string_agg(t.arr[1], ',')
   FROM   regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
   );

没有匹配项的列设置为 NULL。

查询 3

UPDATE tbl
SET    col = (
   SELECT string_agg(elem, ',')
   FROM   regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
        , unnest(t.arr) elem
   );

没有匹配项的列设置为 NULL。

db<>fiddle here（带扩展测试用例）

如何在 UPDATE 语句中使用 regexp_matches()？

How to use regexp_matches() in an UPDATE statement?

regex

sql

postgresql

sql-update

postgresql-10

备注

查询 1

查询 2

查询 3