我怎样才能让 Python 像 grep 一样为重复组重新工作？

Question

我有以下字符串：

seq = 'MNRYLNRQRLYNMYRNKYRGVMEPMSRMTMDFQGRYMDSQGRMVDPRYYDHYGRMHDYDRYYGRSMFNQGHSMDSQRYGGWMDNPERYMDMSGYQMDMQGRWMDAQGRYNNPFSQMWHSRQGH'

也保存在名为 seq.dat 的文件中。如果我使用以下 grep 命令

grep '\([MF]D.\{4,6\}\)\{3,10\}' seq.dat

我得到以下匹配字符串：

MDNPERYMDMSGYQMDMQGRWMDAQGRYN

这就是我想要的。换句话说，我要匹配的是与字符串 [MF]D.{4,6} 一样多的连续重复。我不想匹配连续重复次数少于 3 次的情况，但我希望它最多能够捕获 6 次。

现在，我正在尝试用 python 来做这件事。我有

p = re.compile("(?:[MF]D.{4,6}){3,10}")

正在尝试 search() returns

MDNPERYMDMSGYQMDMQGRWM

接近我要的答案，但还差最后一个MDAQGRYN。我猜这是因为 .{4,6} 匹配 M，这反过来阻止 {3,10} 捕获 ([MF]D.{4,6}) 的第 4 次出现，但由于我要求至少 3，它很高兴，它停止了。

如何使 Python 正则表达式的行为像 grep 一样？

Answer 1

POSIX ("text-directed") 和 NFA ("regex-directed") 引擎之间存在根本区别。 POSIX 引擎（grep 这里使用 POSIX BRE 正则表达式风格，这是默认使用的风格）将解析应用正则表达式的输入文本，并且 return 最长匹配可能。 NFA引擎（Pythonre引擎是NFA引擎）这里不re-consume（回溯）后面的pattern部分匹配时

参见reference on regex-directed and text-directed engines:

A regex-directed engine walks through the regex, attempting to match the next token in the regex to the next character. If a match is found, the engine advances through the regex and the subject string. If a token fails to match, the engine backtracks to a previous position in the regex and the subject string where it can try a different path through the regex... Modern regex flavors using regex-directed engines have lots of features such as atomic grouping and possessive quantifiers that allow you to control this backtracking.

A text-directed engine walks through the subject string, attempting all permutations of the regex before advancing to the next character in the string. A text-directed engine never backtracks. Thus, there isn’t much to discuss about the matching process of a text-directed engine. In most cases, a text-directed engine finds the same matches as a regex-directed engine.

最后一句说“在大多数情况下”，但不是所有情况，你的例子很好地说明了可能会出现差异。

为了避免使用 M 或 F 后紧跟 D，我建议使用

(?:[MF]D(?:(?![MF]D).){4,6}){3,10}

见regex demo。详情:

(?: - 外部 non-capturing 容器组的开始：
- [MF]D - M 或 F 然后 D
- (?:(?![MF]D).){4,6} - 任何重复四到六次的字符（换行符除外），不启动 MD 或 FD 字符序列
){3,10} - 外组结束，重复3到10次。

顺便说一句，如果您只想匹配大写 ASCII 字母，请将 . 替换为 [A-Z]。

我怎样才能让 Python 像 grep 一样为重复组重新工作？

How can I make Python re work like grep for repeating groups?

python

regex

grep