Multi-line 正则表达式 - 捕获重复组的问题
Multi-line regex - issue capturing repeated groups
所以我想看看下面的文字:
This is some header 1
nonsense text 1
Repeated item 1
Repeated item 1 Data
nonsense text 1
Repeated item 2
Repeated item 2 Data
This is some header 2
nonsense text 1
Repeated item 1
Repeated item 1 Data
nonsense text 1
Repeated item 2
Repeated item 2 Data
我正在尝试捕获重复项并捕获它们前面的 header 中的数字,如下所示:
This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
我用这个捕捉重复的项目没有问题:
Repeated Item ([0-9]+)\sSome item data: (.*)
但是,对于每个重复的项目,我也想像这样捕获它之前的 header(但是这个正则表达式不起作用):
This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)
我还尝试了以下正则表达式,它是本文上方正则表达式的派生词:
(?sm)This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)
但是,上面的正则表达式只捕获第一个 header 和最少重复的项目。有没有办法只用正则表达式来实现我想要实现的目标?我显然可以逐行手动解析文本,但我希望我可以用正则表达式实现这一点。
针对您的示例进行了更新:
/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/m
Perl 示例:
$ txt='This is some header 1
>
> nonsense text 1
>
>
> Repeated item 1
> Repeated item 1 Data
>
> nonsense text 1
>
>
> Repeated item 2
> Repeated item 2 Data
>
> This is some header 2
>
> nonsense text 1
>
> Repeated item 1
> Repeated item 1 Data
>
> nonsense text 1
>
> Repeated item 2
> Repeated item 2 Data'
$ echo "$txt" | perl -0777 -lne 'while (/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/gm) {print "\n\n\n\n\n\n" }'
This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
一个更稳健的方法是先将文本分解成块,然后分离出重复的项目。
所以我想看看下面的文字:
This is some header 1
nonsense text 1
Repeated item 1
Repeated item 1 Data
nonsense text 1
Repeated item 2
Repeated item 2 Data
This is some header 2
nonsense text 1
Repeated item 1
Repeated item 1 Data
nonsense text 1
Repeated item 2
Repeated item 2 Data
我正在尝试捕获重复项并捕获它们前面的 header 中的数字,如下所示:
This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
我用这个捕捉重复的项目没有问题:
Repeated Item ([0-9]+)\sSome item data: (.*)
但是,对于每个重复的项目,我也想像这样捕获它之前的 header(但是这个正则表达式不起作用):
This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)
我还尝试了以下正则表达式,它是本文上方正则表达式的派生词:
(?sm)This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)
但是,上面的正则表达式只捕获第一个 header 和最少重复的项目。有没有办法只用正则表达式来实现我想要实现的目标?我显然可以逐行手动解析文本,但我希望我可以用正则表达式实现这一点。
针对您的示例进行了更新:
/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/m
Perl 示例:
$ txt='This is some header 1
>
> nonsense text 1
>
>
> Repeated item 1
> Repeated item 1 Data
>
> nonsense text 1
>
>
> Repeated item 2
> Repeated item 2 Data
>
> This is some header 2
>
> nonsense text 1
>
> Repeated item 1
> Repeated item 1 Data
>
> nonsense text 1
>
> Repeated item 2
> Repeated item 2 Data'
$ echo "$txt" | perl -0777 -lne 'while (/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/gm) {print "\n\n\n\n\n\n" }'
This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
一个更稳健的方法是先将文本分解成块,然后分离出重复的项目。