删除 PostgreSQL 中的所有 Unicode space 分隔符？

Question

我想 trim() 一个列并将任何多个白色 space 和 Unicode space separators 替换为单个 space。背后的想法是清理用户名，防止 2 个用户使用欺骗性名称 foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).

直到现在我一直在使用 SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g');，它删除了 space、制表符和回车符 return，但它不支持 Unicode space separators。

我会添加到正则表达式 \h，但 PostgreSQL 不支持它（\p{Zs}）：

SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');

Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence

我们在 Debian 10 docker 容器中运行 PostgreSQL 12 (12.2-2.pgdg100+1)，使用 UTF-8 编码，并支持用户名中的表情符号。

我有办法实现类似的东西吗？

Answer 1

您可以构造一个括号表达式，其中包含 \p{Zs} Unicode category 中的白色 space 个字符 + 制表符：

REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')

它将用常规 space 字符替换所有出现的一个或多个水平白色space（与支持它的其他正则表达式风格中的 \h 匹配）。

Answer 2

基于Posix“space”character-class（class shorthand \s in Postgres regular expressions), UNICODE“空格”，一些space-like“格式字符”，以及一些额外的non-printing个字符（最后又加了两个来自Wiktor的post），我把这个自定义字符压缩了class:

'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'

所以使用：

SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));

注意：trim() 在 regexp_replace() 之后 ，因此它涵盖了转换后的 space。

重要的是要包含基本 space class \s（[[:space:]] 的缩写，以涵盖所有当前（和未来）基本 space 字符。

我们可能会包含更多字符。或者从剥离所有用 4 个字节编码的字符开始。因为UNICODE是黑暗的，充满了恐怖。

考虑这个演示：

SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph , '\u' || lpad(to_hex(d), 4, '0') AS unicode , chr(d) ~ '\s' AS in_posix_space_class , chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class FROM ( -- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE -- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[]) UNION ALL SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces" UNION ALL SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters" ) t(d) ORDER BY d;

decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class ---------+------+----------+---------+----------------------+----------------- 9 | 9 | | \u0009 | t | t 32 | 20 | | \u0020 | t | t 160 | a0 | | \u00a0 | f | t 5760 | 1680 |   | \u1680 | t | t 6158 | 180e | ᠎ | \u180e | f | t 8192 | 2000 | | \u2000 | t | t 8193 | 2001 | | \u2001 | t | t 8194 | 2002 | | \u2002 | t | t 8195 | 2003 | | \u2003 | t | t 8196 | 2004 | | \u2004 | t | t 8197 | 2005 | | \u2005 | t | t 8198 | 2006 | | \u2006 | t | t 8199 | 2007 | | \u2007 | f | t 8200 | 2008 | | \u2008 | t | t 8201 | 2009 | | \u2009 | t | t 8202 | 200a | | \u200a | t | t 8203 | 200b | | \u200b | f | t 8204 | 200c | ‌ | \u200c | f | t 8205 | 200d | ‍ | \u200d | f | t 8206 | 200e | ‎ | \u200e | f | t 8207 | 200f | ‏ | \u200f | f | t 8239 | 202f |   | \u202f | f | t 8287 | 205f |   | \u205f | t | t 8288 | 2060 | ⁠ | \u2060 | f | t 12288 | 3000 | 　 | \u3000 | t | t 65279 | feff | | \ufeff | f | t (26 rows)

字符生成工具class:

SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']' FROM ( SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[]) UNION ALL SELECT generate_series (8192, 8202) UNION ALL SELECT generate_series (8203, 8207) ) t(d) WHERE chr(d) !~ '\s'; -- not covered by \s

[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]

db<>fiddle here

相关，更多解释：

Trim trailing spaces with PostgreSQL

删除 PostgreSQL 中的所有 Unicode space 分隔符？

Remove all Unicode space separators in PostgreSQL?

regex

postgresql

unicode

trim

removing-whitespace