Multi-line 正则表达式 - 捕获重复组的问题

Multi-line regex - issue capturing repeated groups

所以我想看看下面的文字:

This is some header 1

nonsense text 1


Repeated item 1
Repeated item 1 Data

nonsense text 1


Repeated item 2
Repeated item 2 Data

This is some header 2

nonsense text 1

Repeated item 1
Repeated item 1 Data

nonsense text 1

Repeated item 2
Repeated item 2 Data

我正在尝试捕获重复项并捕获它们前面的 header 中的数字,如下所示:

This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data

This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data

我用这个捕捉重复的项目没有问题:

Repeated Item ([0-9]+)\sSome item data: (.*)

但是,对于每个重复的项目,我也想像这样捕获它之前的 header(但是这个正则表达式不起作用):

This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)

我还尝试了以下正则表达式,它是本文上方正则表达式的派生词:

(?sm)This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)

但是,上面的正则表达式只捕获第一个 header 和最少重复的项目。有没有办法只用正则表达式来实现我想要实现的目标?我显然可以逐行手动解析文本,但我希望我可以用正则表达式实现这一点。

针对您的示例进行了更新:

/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/m

Demo

Perl 示例:

$ txt='This is some header 1
> 
> nonsense text 1
> 
> 
> Repeated item 1
> Repeated item 1 Data
> 
> nonsense text 1
> 
> 
> Repeated item 2
> Repeated item 2 Data
> 
> This is some header 2
> 
> nonsense text 1
> 
> Repeated item 1
> Repeated item 1 Data
> 
> nonsense text 1
> 
> Repeated item 2
> Repeated item 2 Data'

$ echo "$txt" | perl -0777 -lne 'while (/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/gm) {print "\n\n\n\n\n\n" }'
This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data

This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data

一个更稳健的方法是先将文本分解成块,然后分离出重复的项目。