为什么这个正则表达式有时会卡住并冻结我的程序？我可以使用什么替代方案？

Question

import re

input_text_to_check = str(input()) #Input

regex_patron_m1 = r"\s*((?:\w+\s*)+) \s*\¿?(?:would not be what |would not be that |would not be that |would not be the |would not be this |would not be the |would not be some)\s*((?:\w+\s*)+)\s*\??"
m1 = re.search(regex_patron_m1, input_text_to_check, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code

#Validation
if m1:
    word, association = m1.groups()
    word = word.strip()
    association = association.strip()

    print(repr(word))
    print(repr(association))

我认为虽然正则表达式有点长，但对于现代 PC 来说，验证 (?: | | | | ) 中的 10 或 20 个选项应该不会有太多工作这就是为什么我认为问题可能在第一个 \s*((?:\w+\s*)+) \s* and/or 最后一个 \s*((?:\w+\s*)+)\s*

以下是导致正则表达式卡住的输入示例：

"the blue skate would not be that product that you want buy now"

这是一个没有崩溃的例子： "the blue skate would not be that product"

并给我想要摘录的单词：

'the blue skate'
'product'

是否有其他方法可以提取这些选项前后的内容？并且它有时不会崩溃？我制作的这个正则表达式出现问题的原因可能是什么？

Answer 1

基于 this 对 'Catastrophic Backtracking' 的解释，我认为您的正则表达式存在以下问题：

您尝试用 ((?:\w+\s*)+) 匹配的东西可以通过多种方式匹配。假设您在输入字符串 abc 上使用 ((?:\w+\s*)+)。这可以通过多种方式匹配：

（a 和 0 空格）（b 和 0 空格）（c 和 0 空格）
（a 和 0 空格）（bc 和 0 空格）
（ab 和 0 空格）（c 和 0 空格）

只要你只需要匹配((?:\w+\s*)+)就可以了。但是当你之后添加其他东西时（比如你的情况下的 10 个左右的选项）正则表达式需要做一些大量的回避。查看提供的 link 以获得更好的解释。

在 \w 之后删除 + 会导致提供的两种情况的工作正则表达式：


"\s*((?:\w\s*)+) \s*\¿?(?:would not be what |would not be that |would not be that |would not be the |would not be this |would not be the |would not be some)\s*((?:\w\s*)+)\s*\??"gm

这是否适用于您的设备和所有测试用例？

为什么这个正则表达式有时会卡住并冻结我的程序？我可以使用什么替代方案？

Why does this regular expression sometimes get stuck and freeze my program? what alternative could i use?

python

regex

string

validation

python-re