替换以相同模式开头的连续行块

Question

我想匹配（并用自定义替换函数替换）连续行的每个块，所有行都以 foo 开头。这几乎可以工作：

import re

s = """bar6387
bar63287
foo1234
foohelloworld
fooloremipsum
baz
bar
foo236
foo5382
bar
foo879"""

def f(m):
    print(m)

s = re.sub('(foo.*\n)+', f, s)
print(s)
# <re.Match object; span=(17, 53), match='foo1234\nfoohelloworld\nfooloremipsum\n'>
# <re.Match object; span=(61, 76), match='foo236\nfoo5382\n'>

但它无法识别最后一个块，显然是因为它是最后一行并且末尾没有\n。

是否有更简洁的方法来匹配以相同模式开头的一行或多行的块foo？

Answer 1

这是一个 re.findall 方法：

s = """bar6387
bar63287
foo1234
foohelloworld
fooloremipsum
baz
bar
foo236
foo5382
bar
foo879"""

lines = re.findall(r'^foo.*(?:\nfoo.*(?=\n|$))*', s, flags=re.M)
print(lines)
# ['foo1234\nfoohelloworld\nfooloremipsum',
   'foo236\nfoo5382',
   'foo879']

上面的正则表达式以多行模式运行，并表示要匹配：

^                     from the start of a line
foo                   "foo"
.*                    consume the rest of the line
(?:\nfoo.*(?=\n|$))*  match newline and another "foo" line, 0 or more times

编辑：

如果您需要 replace/remove 这些块，请将相同的模式与 re.sub 和 lambda 回调一起使用：

output = re.sub(r'^foo.*(?:\nfoo.*(?=\n|$))*', lambda m: "BLAH", s, flags=re.M)
print(output)

这会打印：

bar6387
bar63287
BLAH
baz
bar
BLAH
bar
BLAH

Answer 2

你真的需要正则表达式吗？这是一个基于 itertools.groupby 的方法：

from itertools import groupby
import re

# dummy example function
f = lambda x: '>>'+x.upper()+'<<'

out= '\n'.join(f(G) if (G:='\n'.join(g)) and k else G
               for k,g in groupby(s.split('\n'), lambda l: l.startswith('foo')))

print(out)

注意。您 不需要 正则表达式，但如果需要可以也可以使用正则表达式来定义 groupby[= 中的匹配行27=]

# using a regex to match the blocks: out= '\n'.join(f(G) if (G:='\n'.join(g)) and k else G for k,g in groupby(s.split('\n'), lambda l: bool(re.match('foo', l)) ))

输出：

bar6387 bar63287 >>FOO1234 FOOHELLOWORLD FOOLOREMIPSUM<< baz bar >>FOO236 FOO5382<< barfoo bar >>FOO879<<

Answer 3

您可以使用

re.sub(r'(?m)^foo.*(?:\nfoo.*)*', f, s)
re.sub(r'^foo.*(?:\nfoo.*)*', f, s, flags=re.M)

哪里

^ - 匹配字符串的开头（这里是由于 (?m) 或 re.M 选项而导致的任何行的开头）
foo - 匹配 foo
.* - 除换行字符外的任何零个或多个字符尽可能多
(?:\nfoo.*)* - 换行符的零个或多个序列，foo 然后是该行的其余部分。

参见 Python demo:

import re

s = "bar6387\nbar63287\nfoo1234\nfoohelloworld\nfooloremipsum\nbaz\nbar\nfoo236\nfoo5382\nbar\nfoo879"
def f(m):
    print(m.group().replace('\n', r'\n'))

re.sub(r'(?m)^foo.*(?:\nfoo.*)*', f, s)

输出：

foo1234\nfoohelloworld\nfooloremipsum
foo236\nfoo5382
foo879

替换以相同模式开头的连续行块

Replace block of consecutive lines starting with same pattern

python

regex

python-re