正则表达式匹配 (\w+) 以捕获由 ||| 分隔的单个单词- Python

Question

如果有单个单词后跟 \s|||\s，然后是另一个单词后跟 \s|||\s，我正在尝试匹配，所以我使用这个正则表达式：

single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'

当我尝试匹配这个字符串时，正则表达式匹配挂起或需要几分钟（可能会进入某种 "deep loop"）

>>> import re
>>> import time
>>> single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'        
>>> x = u'amornratchatchawansawangwong ||| amornratchatchawansawangwong . ||| 0.594819 0.5 0.594819 0.25 ||| 0-0 0-1 ||| 1 1 1 ||| |||'
>>> z = u'amor 我 ||| amor . i ||| 0.594819 0.0585231 0.594819 0.0489472 ||| 0-0 0-1 1-2 ||| 2 2 2 ||| |||'
>>> y = u'amor ||| amor ||| 0.396546 0.0833347 0.29741 0.08 ||| 0-0 0-1 ||| 3 4 2 ||| |||'
>>> re.match(single_word_regex, z, re.U)                                              
>>> re.match(single_word_regex, y, re.U)                                          
<_sre.SRE_Match object at 0x105b879c0>
>>> start = time.time(); re.match(single_word_regex, y, re.U); print time.time() - start
9.60826873779e-05
>>> start = time.time(); re.match(single_word_regex, x, re.U); print time.time() - start # It hangs...

为什么要这么久？

是否有 better/simpler 正则表达式来捕获此条件len(x.split(' ||| ')[0].split()) == 1 == len(x.split(' ||| ').split())？

Answer 1

请注意，r'(\w+)+' 模式本身不会导致 catastrophic backtracking, it will only be "evil" inside a longer expression and especially when it is placed next to the start of the pattern since in case subsequent subpatterns fail the engine backtracks to this one, and as the 1+ quantifier inside is again quantified with +, that creates a huge amount of possible variations to try before failing. You may have a look at your regex demo 并单击左侧的 regex 调试器 以查看示例正则表达式引擎行为.

当前正则表达式可以写成

r'^(\w+)\s\|{3}\s(\w+)\s\|{3}\s(.*)'

请参阅 regex demo，如果您在第二个字段中删除 space 和 .，将会有匹配项。

详情:

^ - 字符串的开头（re.match 不需要）
(\w+) -（第 1 组）1+ letters/digits/underscores
\s - 白space
\|{3} - 3 个竖线符号
\s(\w+)\s\|{3}\s - 见上文（(\w+) 创建第 2 组）
(.*) -（第 3 组）除换行符以外的任何 0+ 个字符。

正则表达式匹配 (\w+) 以捕获由 ||| 分隔的单个单词- Python

Regex match (\w+) to capture single words delimited by ||| - Python

python

regex

loops

freeze