在 RegEx 中捕获字符串部分

Question

我想映射一个字符串的不同部分，其中一些是可选的，一些是始终存在的。我正在使用 Calibre 的内置函数（基于 Python 正则表达式），但这是一个普遍的问题：我如何在正则表达式中做到这一点？

示例字符串：

!!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by 1 Vladimir Popov (Jun 17, 2014 4_1).pdf

字符串的结构如下：

[importance markings if any, it can be '!' or '!!'][title][ISBN-10 if available]by[author]([publication date and other metadata]).[file type]

最后我创建了这个正则表达式，但它并不完美，因为如果 ISBN 出现，标题也会包含 ISBN 部分...

(?P<title>[A-Za-z0-9].+(?P<isbn>[0-9]{10})|([A-Za-z0-9].*))\sby\s.*?(?P<author>[A-Z0-9].*)(?=\s\()

这是我的沙箱：https://regex101.com/r/K2FzpH/1

非常感谢任何帮助！

Answer 1

您可以使用：

而不是使用更改

^!*(?P<title>[A-Za-z0-9].+?)(?:\s+(?P<isbn>[0-9]{10}))?\s+by\s+(?P<author>[A-Z0-9][^(]+)(?=\s\()

^ 字符串开头
!* 匹配可选的感叹号
(?P<title>[A-Za-z0-9].+?) 命名组 title，匹配字符 class 中的范围，然后匹配尽可能少的字符
(?:\s+(?P<isbn>[0-9]{10}))? 可选择匹配 1+ 个空白字符和匹配 10 个数字的命名组 isbn
\s+by\s+ 在 1 个或多个空白字符之间匹配 by
(?P<author>[A-Z0-9][^(]+) 命名组 author 匹配 A-Z 或 0-9 后跟除 (
(?=\s\() 正面断言 ( 直接向右

Regex demo

在 RegEx 中捕获字符串部分

Capturing string parts in RegEx

regex

calibre