避免对我当前的要求进行灾难性的回溯

Question

我必须匹配短语中的某些特定条件，（词组）（介于两者之间的任何内容）（词组）例如：

(mirror|reflect|serve|adapt)(\s*\w+\s*\W*\s*)*?(population|client|customer|stakeholder|market|society|culture|consumer|end-user)

所以在我有 "mirror bananas banannas population" 的短语中的任何时候我都想匹配它。这是最好的解决方案吗？是否容易发生灾难性回溯？

Answer 1

(\s*\w+\s*\W*\s*)*? 部分可能会导致灾难性的回溯，因为 *? 量化组中唯一的强制模式是 \w+ 并且它包含在其他可选模式中（\s* 和 \W* 可能匹配空字符串，并注意相邻的 * 量化模式如 \s*\W*\s* 匹配相同的字符，这是导致灾难性回溯的不良做法）。

如果你test your regex against mirror banana banannas populatio你会得到灾难性的回溯错误。

在您的情况下，最好的正则表达式方式是，当您从 JSON 文件中读取 leading/trailing 单词列表时，使用像

这样的正则表达式

(?:leading_word1|leading_word2|...|leading_wordN)(.*?)(?:trailing_word1|trailing_word2|...|trailing_wordN)

如果您使用 re.findall（您说您使用的是 Python），您需要的值将在第 1 组或列表中的所有值。

避免对我当前的要求进行灾难性的回溯

Avoid catastrophic backtracking for my current requirement

regex

regex-greedy