提取整数的正则表达式

Question

我需要帮助从存储文本的列中提取数字。在文本中，也可以有一些我不想提取的价格。例如，如果我有以下文本：

text = "I have the following products 4526 and 4. The first one I paid  while the second one 30€. 
Here the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"

我的预期结果是

[4526, 4]

现在我使用的正则表达式如下

'(?<![\d.])[0-9]+(?![\d.])'

它能够丢弃 3.99，但它仍然可以识别 link 中的价格和数字。关于如何更新 re 有什么建议吗？

Answer 1

您可以断言左侧的空白边界，并排除匹配后跟数字或欧元符号的点。

(?<!\S)\d+\b(?!€|\.\d)

(?<!\S) 断言左侧不是非空白字符（空白边界）
\d+ 匹配 1+ 个数字
\b 防止部分匹配的单词边界
(?!€|\.\d) 断言直接在右边的内容不是 € 或 . 后跟数字的否定前瞻。

Regex demo | Python demo

例子

import re
 
pattern = r"(?<!\S)\d+\b(?!€|\.\d)"
s = ("I have the following products 4526 and 4. The first one I paid  while the second one 30€. \n"
    "Here the link for the discount of 3.99: https://w...content-available-to-author-only...d.coom/7574@5757\n")
 
print(re.findall(pattern, s))

输出

['4526', '4']

Answer 2

使用

(?<!\S)[0-9]+(?!\.\d|[^\s!?.])

见proof。

解释

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  [0-9]+                   any character of: '0' to '9' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [^\s!?.]                 any character except: whitespace (\n,
                             \r, \t, \f, and " "), '!', '?', '.'
--------------------------------------------------------------------------------
  )                        end of look-ahead

Python code:

import re
regex = r"(?<!\S)[0-9]+(?!\.\d|[^\s!?.])"
test_str = "I have the following products 4526 and 4. The first one I paid  while the second one 30€. \nHere the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
matches = re.findall(regex, test_str)
print(matches)

结果：['4526', '4']

提取整数的正则表达式

Regular expression to extract integers

python

regex

python-re