Java/Groovy 正则表达式解析不带分隔符的键值对

Question

我在使用正则表达式获取键值对时遇到问题

到目前为止的代码：

String raw = '''
MA1
 D. Mueller Gießer 
MA2 Peter 
Mustermann 2. Mann  
MA3 Ulrike Mastorius Schmelzer 
MA4 Heiner Becker s 3.Mann
 MA5 Rudolf Peters 
Gießer 
'''

Map map = [:] 
ArrayList<String> split = raw.findAll("(MA\d)+(.*)"){ full, name, value ->  map[name] = value }  
println map

输出为： [MA1:, MA2: Peter, MA3: Ulrike Mastorius Schmelzer, MA4: Heiner Becker, MA5: Rudolf Peters]

在我的例子中，关键是： MA1, MA2, MA3, MA\d（所以 MA 有任何 1 位数字）

该值绝对是下一个键出现之前的所有内容（包括换行符、制表符、空格等...）

有人知道怎么做吗？

提前致谢，塞巴斯蒂安

Answer 1

您可以在第二组中捕获键之后的所有内容以及不以键开头的所有行

^(MA\d+)(.*(?:\R(?!MA\d).*)*)

模式匹配

^ 字符串开头
(MA\d+) 捕获 组 1 匹配 MA 和 1+ 数字
( 捕获 第 2 组
- .* 匹配行的其余部分
- (?:\R(?!MA\d).*)* 匹配所有不以 MA 后跟数字开头的行，其中 \R 匹配任何 unicode 换行序列
) 关闭组 2

Regex demo

在 Java 中使用双转义反斜杠

final String regex = "^(MA\d+)(.*(?:\R(?!MA\d).*)*)";

Answer 2

使用

(?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)

参见proof。

说明

                         EXPLANATION
--------------------------------------------------------------------------------
  (?ms)                    set flags for this block (with ^ and $
                           matching start and end of line) (with .
                           matching \n) (case-sensitive) (matching
                           whitespace and # normally)
--------------------------------------------------------------------------------
  ^                        the beginning of a "line"
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    MA                       'MA'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \n                       '\n' (newline)
--------------------------------------------------------------------------------
    MA                       'MA'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \z                       the end of the string
--------------------------------------------------------------------------------
  )                        end of look-ahead

Java/Groovy 正则表达式解析不带分隔符的键值对

Java/Groovy regex parse Key-Value pairs without delimiters

java

regex

groovy

regex-greedy