Java/Groovy 正则表达式解析不带分隔符的键值对
Java/Groovy regex parse Key-Value pairs without delimiters
我在使用正则表达式获取键值对时遇到问题
到目前为止的代码:
String raw = '''
MA1
D. Mueller Gießer
MA2 Peter
Mustermann 2. Mann
MA3 Ulrike Mastorius Schmelzer
MA4 Heiner Becker
s 3.Mann
MA5 Rudolf Peters
Gießer
'''
Map map = [:]
ArrayList<String> split = raw.findAll("(MA\d)+(.*)"){ full, name, value -> map[name] = value }
println map
输出为:
[MA1:, MA2: Peter, MA3: Ulrike Mastorius Schmelzer, MA4: Heiner Becker, MA5: Rudolf Peters]
在我的例子中,关键是:
MA1, MA2, MA3, MA\d(所以 MA 有任何 1 位数字)
该值绝对是下一个键出现之前的所有内容(包括换行符、制表符、空格等...)
有人知道怎么做吗?
提前致谢,
塞巴斯蒂安
您可以在第二组中捕获键之后的所有内容以及不以键开头的所有行
^(MA\d+)(.*(?:\R(?!MA\d).*)*)
模式匹配
^
字符串开头
(MA\d+)
捕获 组 1 匹配 MA 和 1+ 数字
(
捕获 第 2 组
.*
匹配行的其余部分
(?:\R(?!MA\d).*)*
匹配所有不以 MA 后跟数字开头的行,其中 \R
匹配任何 unicode 换行序列
)
关闭组 2
在 Java 中使用双转义反斜杠
final String regex = "^(MA\d+)(.*(?:\R(?!MA\d).*)*)";
使用
(?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)
参见proof。
说明
EXPLANATION
--------------------------------------------------------------------------------
(?ms) set flags for this block (with ^ and $
matching start and end of line) (with .
matching \n) (case-sensitive) (matching
whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of look-ahead
我在使用正则表达式获取键值对时遇到问题
到目前为止的代码:
String raw = '''
MA1
D. Mueller Gießer
MA2 Peter
Mustermann 2. Mann
MA3 Ulrike Mastorius Schmelzer
MA4 Heiner Becker
s 3.Mann
MA5 Rudolf Peters
Gießer
'''
Map map = [:]
ArrayList<String> split = raw.findAll("(MA\d)+(.*)"){ full, name, value -> map[name] = value }
println map
输出为: [MA1:, MA2: Peter, MA3: Ulrike Mastorius Schmelzer, MA4: Heiner Becker, MA5: Rudolf Peters]
在我的例子中,关键是: MA1, MA2, MA3, MA\d(所以 MA 有任何 1 位数字)
该值绝对是下一个键出现之前的所有内容(包括换行符、制表符、空格等...)
有人知道怎么做吗?
提前致谢, 塞巴斯蒂安
您可以在第二组中捕获键之后的所有内容以及不以键开头的所有行
^(MA\d+)(.*(?:\R(?!MA\d).*)*)
模式匹配
^
字符串开头(MA\d+)
捕获 组 1 匹配 MA 和 1+ 数字(
捕获 第 2 组.*
匹配行的其余部分(?:\R(?!MA\d).*)*
匹配所有不以 MA 后跟数字开头的行,其中\R
匹配任何 unicode 换行序列
)
关闭组 2
在 Java 中使用双转义反斜杠
final String regex = "^(MA\d+)(.*(?:\R(?!MA\d).*)*)";
使用
(?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)
参见proof。
说明
EXPLANATION
--------------------------------------------------------------------------------
(?ms) set flags for this block (with ^ and $
matching start and end of line) (with .
matching \n) (case-sensitive) (matching
whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of look-ahead