提取整数的正则表达式
Regular expression to extract integers
我需要帮助从存储文本的列中提取数字。在文本中,也可以有一些我不想提取的价格。例如,如果我有以下文本:
text = "I have the following products 4526 and 4. The first one I paid while the second one 30€.
Here the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
我的预期结果是
[4526, 4]
现在我使用的正则表达式如下
'(?<![\d.])[0-9]+(?![\d.])'
它能够丢弃 3.99,但它仍然可以识别 link 中的价格和数字。
关于如何更新 re 有什么建议吗?
您可以断言左侧的空白边界,并排除匹配后跟数字或欧元符号的点。
(?<!\S)\d+\b(?!€|\.\d)
(?<!\S)
断言左侧不是非空白字符(空白边界)
\d+
匹配 1+ 个数字
\b
防止部分匹配的单词边界
(?!€|\.\d)
断言直接在右边的内容不是 €
或 .
后跟数字的否定前瞻。
例子
import re
pattern = r"(?<!\S)\d+\b(?!€|\.\d)"
s = ("I have the following products 4526 and 4. The first one I paid while the second one 30€. \n"
"Here the link for the discount of 3.99: https://w...content-available-to-author-only...d.coom/7574@5757\n")
print(re.findall(pattern, s))
输出
['4526', '4']
使用
(?<!\S)[0-9]+(?!\.\d|[^\s!?.])
见proof。
解释
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^\s!?.] any character except: whitespace (\n,
\r, \t, \f, and " "), '!', '?', '.'
--------------------------------------------------------------------------------
) end of look-ahead
import re
regex = r"(?<!\S)[0-9]+(?!\.\d|[^\s!?.])"
test_str = "I have the following products 4526 and 4. The first one I paid while the second one 30€. \nHere the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
matches = re.findall(regex, test_str)
print(matches)
结果:['4526', '4']
我需要帮助从存储文本的列中提取数字。在文本中,也可以有一些我不想提取的价格。例如,如果我有以下文本:
text = "I have the following products 4526 and 4. The first one I paid while the second one 30€.
Here the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
我的预期结果是
[4526, 4]
现在我使用的正则表达式如下
'(?<![\d.])[0-9]+(?![\d.])'
它能够丢弃 3.99,但它仍然可以识别 link 中的价格和数字。 关于如何更新 re 有什么建议吗?
您可以断言左侧的空白边界,并排除匹配后跟数字或欧元符号的点。
(?<!\S)\d+\b(?!€|\.\d)
(?<!\S)
断言左侧不是非空白字符(空白边界)\d+
匹配 1+ 个数字\b
防止部分匹配的单词边界(?!€|\.\d)
断言直接在右边的内容不是€
或.
后跟数字的否定前瞻。
例子
import re
pattern = r"(?<!\S)\d+\b(?!€|\.\d)"
s = ("I have the following products 4526 and 4. The first one I paid while the second one 30€. \n"
"Here the link for the discount of 3.99: https://w...content-available-to-author-only...d.coom/7574@5757\n")
print(re.findall(pattern, s))
输出
['4526', '4']
使用
(?<!\S)[0-9]+(?!\.\d|[^\s!?.])
见proof。
解释
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^\s!?.] any character except: whitespace (\n,
\r, \t, \f, and " "), '!', '?', '.'
--------------------------------------------------------------------------------
) end of look-ahead
import re
regex = r"(?<!\S)[0-9]+(?!\.\d|[^\s!?.])"
test_str = "I have the following products 4526 and 4. The first one I paid while the second one 30€. \nHere the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
matches = re.findall(regex, test_str)
print(matches)
结果:['4526', '4']