正则表达式匹配 (\w+) 以捕获由 ||| 分隔的单个单词- Python

Regex match (\w+) to capture single words delimited by ||| - Python

如果有单个单词后跟 \s|||\s,然后是另一个单词后跟 \s|||\s,我正在尝试匹配,所以我使用这个正则表达式:

single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'

当我尝试匹配这个字符串时,正则表达式匹配挂起或需要几分钟(可能会进入某种 "deep loop")

>>> import re
>>> import time
>>> single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'        
>>> x = u'amornratchatchawansawangwong ||| amornratchatchawansawangwong . ||| 0.594819 0.5 0.594819 0.25 ||| 0-0 0-1 ||| 1 1 1 ||| |||'
>>> z = u'amor 我 ||| amor . i ||| 0.594819 0.0585231 0.594819 0.0489472 ||| 0-0 0-1 1-2 ||| 2 2 2 ||| |||'
>>> y = u'amor ||| amor ||| 0.396546 0.0833347 0.29741 0.08 ||| 0-0 0-1 ||| 3 4 2 ||| |||'
>>> re.match(single_word_regex, z, re.U)                                              
>>> re.match(single_word_regex, y, re.U)                                          
<_sre.SRE_Match object at 0x105b879c0>
>>> start = time.time(); re.match(single_word_regex, y, re.U); print time.time() - start
9.60826873779e-05
>>> start = time.time(); re.match(single_word_regex, x, re.U); print time.time() - start # It hangs...

为什么要这么久?

是否有 better/simpler 正则表达式来捕获此条件len(x.split(' ||| ')[0].split()) == 1 == len(x.split(' ||| ').split())

请注意,r'(\w+)+' 模式本身不会导致 catastrophic backtracking, it will only be "evil" inside a longer expression and especially when it is placed next to the start of the pattern since in case subsequent subpatterns fail the engine backtracks to this one, and as the 1+ quantifier inside is again quantified with +, that creates a huge amount of possible variations to try before failing. You may have a look at your regex demo 并单击左侧的 regex 调试器 以查看示例正则表达式引擎行为.

当前正则表达式可以写成

r'^(\w+)\s\|{3}\s(\w+)\s\|{3}\s(.*)'

请参阅 regex demo,如果您在第二个字段中删除 space 和 .,将会有匹配项。

详情:

  • ^ - 字符串的开头(re.match 不需要)
  • (\w+) -(第 1 组)1+ letters/digits/underscores
  • \s - 白space
  • \|{3} - 3 个竖线符号
  • \s(\w+)\s\|{3}\s - 见上文((\w+) 创建第 2 组)
  • (.*) -(第 3 组)除换行符以外的任何 0+ 个字符。