ERROR: argument of OR must not return a set

Question

运行这里是 PostgreSQL 9.2.8...

我一直在尝试获取 table 中地址中包含非 ASCII 字符的所有行的列表 - <space> 到 ~ 范围之外的任何内容, 以及反引号字符 ` 。如果任何行包含任何无效字符，将显示包含所有地址值的行。但由于某种原因，我收到以下错误：

ERROR:  argument of OR must not return a set (10586)
LINE 9: (CAST(regexp_matches(a.address_line_1,'([^ !-~]|`)') AS VARCHAR)...
         ^

********** Error **********

ERROR: argument of OR must not return a set (10586)
SQL state: 42804
Character: 252

我一直在尝试使用的查询如下：

select a.address_id, a.address_line_1, 
    a.address_line_2, 
    a.address_line_3, 
regexp_matches(a.address_line_1,'([^ !-~]|`)'),
regexp_matches(a.address_line_2,'([^ !-~]|`)'),
regexp_matches(a.address_line_3,'([^ !-~]|`)')
    FROM public.address a 
WHERE 
(CAST(regexp_matches(a.address_line_1,'([^ !-~]|`)') AS VARCHAR) <> '') OR
(CAST(regexp_matches(a.address_line_2,'([^ !-~]|`)') AS VARCHAR) <> '') OR
(CAST(regexp_matches(a.address_line_3,'([^ !-~]|`)') AS VARCHAR) <> '')
LIMIT 1000

我不确定我可能遗漏了什么，因为这似乎是一个有效的查询。

我试图获取在三个地址字段中的任何一个中存在无效字符的行，而不仅仅是在所有三个地址字段中都有一个无效字符。

Answer 1

一种方法使用 exists:

where exists (select 1 from regexp_matches(a.address_line_1, '[^ !-~]')) or
      exists (select 1 from regexp_matches(a.address_line_2, '[^ !-~]')) or
      exists (select 1 from regexp_matches(a.address_line_3, '[^ !-~]'))

或者，更简单地说：

where a.address_line_1 ~ '[^ !-~]' or
      a.address_line_2 ~ '[^ !-~]' or
      a.address_line_3 ~ '[^ !-~]'

Answer 2

regexp_matches() returns SETOF text and cannot be used like you tried (as the error message tells you). You could use the regular expression operator ~ 代替。

但是您的正则表达式似乎没有涵盖您描述的内容：

non-ASCII characters in the address

此外，括号表达式 [^ !-~] 中的范围 !-~ 取决于您的 COLLATION 设置。 The manual warns:

Ranges are very collating-sequence-dependent, so portable programs should avoid relying on them.

考虑：

SELECT g, chr(g), chr(g) ~ '([^ !-~]|`)'
FROM   generate_series (1,300) g;  -- ASCII range plus some

假设服务器编码为 UTF8，查找 3 列中包含任何非 ASCII 字符的行：

...
WHERE octet_length(concat(a.address_line_1, a.address_line_2, a.address_line_3))
         <> length(concat(a.address_line_1, a.address_line_2, a.address_line_3))

这是有效的，因为所有非 ASCII 字符都在 UTF8 中用超过 1 个字节编码，因此 octet_length() 报告的数字比 length()（别名：char_length()）大。与 concat() 的串联可防止可能的 NULL 值。

要同时测试反引号，请添加：

...
OR  concat(a.address_line_1, a.address_line_2, a.address_line_3) LIKE '%`%'

ERROR: argument of OR must not return a set

ERROR: argument of OR must not return a set

regex

sql

postgresql

non-ascii-characters