在转换为数字之前使用正则表达式过滤字符串

Question

我有这个代码（已经存在，不是我的）：

SELECT
    a.id_original_contrato AS contrato,
    ( CASE WHEN d.value~'^\d+$' THEN d.value::integer ELSE 0 END ) AS monto,
    EXTRACT(YEAR FROM b.value)::integer AS anoinicio,
    EXTRACT(YEAR FROM c.value)::integer AS anofin

...等（一些 JOIN 和 WHERE）

让我解释一下：d.value 来自 table，其中值为 character varying (200)。该代码稍后会将 d.value（现在称为 'monto'）作为 integer 插入另一个 table。有人对该正则表达式进行编码以提取一些字符，或者在其他情况下 (ELSE)，将其定义为 0。这些值仅在 integer 时有效。如果我使用 d.value 之类的 76.44 它不起作用，因为正则表达式总是将其定义为 0.

好吧，我必须更改该代码，因为：

我需要在新 table 中将 d.value 存储为 numeric，而不是 integer（在我的新 table 中，数据类型现在 numeric)
但首先，我需要更正该正则表达式，因为它弄乱了我的数字，例如 76.44 或 66,56（点或逗号）。

我不确定正则表达式在做什么。我怎样才能用更好的或新的正则表达式来满足需求？

Answer 1

选择变体：

with v(value) as (
    values
    ('12,3'),
    ('12.3'),
    ('123'),
    ('123.'),
    ('.123'),
    ('1.2.3')
    )

select 
    value, 
    value ~ '^(\d+[,\.]\d+|\d+)$' as variant_a,
    value ~ '^(\d*[,\.]\d*|\d+)$' as variant_b,
    value ~ '^\d+[,\.]\d+$' as variant_c
from v;

 value | variant_a | variant_b | variant_c 
-------+-----------+-----------+-----------
 12,3  | t         | t         | t
 12.3  | t         | t         | t
 123   | t         | t         | f
 123.  | f         | t         | f
 .123  | f         | t         | f
 1.2.3 | f         | f         | f
(6 rows)

要将带点或逗号的字符串转换为数字，请使用 replace():

select replace(value, ',', '.')::numeric;

Answer 2

您应该声明您的 Postgres 版本以及编写代码时使用的版本（如果您知道的话）。 \d 中的双反斜杠表示带有 standard_conforming_strings = off 的旧版本。 The manual:

Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off).

在带有 standard_conforming_strings = on 的现代版本中，此字符串作为正则表达式毫无意义：~~'^\d+$'~~。要检测由一位或多位数字组成的字符串，请使用 E'^\d+$'（前缀为 E）或 '^\d+$'。详情：

Insert text with single quotes in PostgreSQL

整数字面量还允许可选的 前导符号 表示负数/正数。并且在 Postgres 中也允许（自动修剪）前导/悬挂 white space。
所以，这是 integer:

的完整正则表达式

CASE WHEN d.value ~ <b>'^\s*[-+]?\d+\s*$'</b> THEN d.value::int ELSE 0 END

正则表达式解释：

^ .. 字符串开头
\s .. class shorthand for [[:space:]] (白色 space)
* .. quantifier 0 次或更多次
[+-] .. 字符 class 由 + 和 -
组成 ? .. 0 或 1 次的量词
\d .. class shorthand 为 [[:digit:]]（位数）
+ .. 1 次或多次的量词
\s* ..同上
$ .. 字符串结尾

现在我们了解了基础知识。在我链接的手册中阅读更多内容。考虑 numeric string literals 的语法规则。并且，虽然关于合法数字常量的状态：

There cannot be any spaces or other characters embedded in the constant

那是因为没有引用数字常量，所以不可能有白色space。不适用于 casting 字符串。白色 space 是 可以容忍的 ：前导、尾随和紧跟在指数字符之后。

因此这些都是转换为 numeric:

的所有合法字符串

'^\s*[-+]?\d*\.?\d+(?:[eE]\s*[-+]?\d+)?\s*$'

唯一的新元素是 parentheses (()) to denote the contained regular expression as atom。由于我们对反向引用不感兴趣，所以使用 "non-capturing": (?:...) 并附加一个问号 (?:[eE]\s*[-+]?\d+)? 来表示："exponential" 部分可以加不加，整体.

假设点 (.) 作为小数点分隔符。您可以使用逗号 (,) 或 [,\.] 来允许。但只有点对演员来说是合法的。

测试：

SELECT '|' || txt || '|' As text_with_delim
     , txt ~ '^\s*[-+]?\d*\.?\d+([eE]\s*[-+]?\d+)?\s*$' As test
     , txt::numeric AS number
FROM   unnest ('{1, 123, 000, "  -1     ", +2, 1.2, .34, 5e6, " .5e   -6  "}'::text[]) txt;

结果：

 text_with_delim | test |  number
-----------------+------+-----------
 |1|             | t    |         1
 |123|           | t    |       123
 |000|           | t    |         0
 |  -1     |     | t    |        -1
 |+2|            | t    |         2
 |1.2|           | t    |       1.2
 |.34|           | t    |      0.34
 |5e6|           | t    |   5000000
 | .5e   -6  |   | t    | 0.0000005

或者您可能使用 to_number() 来转换任意给定格式的字符串。

在转换为数字之前使用正则表达式过滤字符串

Filter strings with regex before casting to numeric

regex

postgresql

casting

numeric