正则表达式仅匹配具有特定名称（'naslov'）的书（'knjiga'）

Question

我有一个简单的xml:

<?xml version="1.0" encoding="utf-8" ?>
<book_list>
    <book rbr="1" >
        <title> Yacc </title>
        <author> Filip Maric </author>
        <year> 2004 </year>
        <publisher> Matematicki fakultet </publisher>
        <price currency="din"> 100 </price>
    </book>
    <book rbr="2" >
        <author> Fredrik Lundh </author>
        <price currency="eur"> 50 </price>
        <publisher> O’Reilly & Associates </publisher>
        <year> 2001 </year>
        <title> Python Standard Library </title>
    </book>
</book_list>

我需要在 Python 中使用正则表达式匹配具有特定名称的书。我可以轻松地将任何书籍与：

r'<book\s*rbr="\d+"\s*>.*?</book>'

（打开单行模式），然后检查它是否正确，但如果我想匹配特定的书 - 例如，Python 标准库，直接使用正则表达式，我无法得到对的。如果我尝试

r'<book\s*rbr="\d+"\s*>(?P<book>.*?<title> Python Standard Library </title>.*?)</book>'

，打开单行模式，它会从头开始匹配所有内容，我明白为什么，但我找不到只匹配一个书签的方法。我尝试了所有查找和所有不同模式，但均未成功。

什么是正确的方法，适用于 book_list 中任意数量的书籍？

Answer 1

由于 <title> 标签并非始终是 <book> 下的第一个 child 标签，因此问题变得非常复杂。如果是，您可以使用：

m = re.search(r'<book\s*rbr="\d+"\s*>\s*(?P<book><title> Python Standard Library </title>).*?</book>', xml, flags=re.DOTALL)

即用\s*代替.*?。

诀窍是确保在匹配 <book> 标签后，您要查找的 <title> 标签不会出现在未来的 </book> 标签之后。这可以通过负前瞻来完成（这并不漂亮）：

import re

xml = """<?xml version="1.0" encoding="utf-8" ?>
<book_list>
    <book rbr="1" >
        <title> Yacc </title>
        <author> Filip Maric </author>
        <year> 2004 </year>
        <publisher> Matematicki fakultet </publisher>
        <price currency="din"> 100 </price>
    </book>
    <book rbr="2" >
        <author> Fredrik Lundh </author>
        <price currency="eur"> 50 </price>
        <publisher> O’Reilly & Associates </publisher>
        <year> 2001 </year>
        <title> Python Standard Library </title>
    </book>
</book_list>"""

m = re.search(r'<book\s*rbr="\d+"\s*>(?!.*</book>.*<title> Python Standard Library </title>).*(?P<book><title> Python Standard Library </title>).*?</book>', xml, flags=re.DOTALL)
print(m.group('book'))
m = re.search(r'<book\s*rbr="\d+"\s*>(?!.*</book>.*<title> Yacc </title>).*(?P<book><title> Yacc </title>).*?</book>', xml, flags=re.DOTALL)
print(m.group('book'))

打印：

<title> Python Standard Library </title>
<title> Yacc </title>

See demo

如果您的 Python 支持它们，您可以使用 格式化字符串文字 来减少冗余（如果不支持，则使用 str.format 方法）：

title = '<title> Python Standard Library </title>'
m = re.search(rf'<book\s*rbr="\d+"\s*>(?!.*</book>.*{title}).*(?P<book>{title}).*?</book>', xml, flags=re.DOTALL)

另一种方法

此方法会构建所有单独 <book> 标签的列表，然后按顺序搜索每个标签以查找感兴趣的标题：

# create list of <book> ... </book> strings:
books = re.findall(r'<book\s*rbr="\d+"\s*>.*?</book>', xml, flags=re.DOTALL)
title = '<title> Python Standard Library </title>'
# now search each <book>...</book> string looking for the title string:
for book in books:
    if re.search(rf'{title}', book):
        print(title)
        print(book)

打印：

<title> Python Standard Library </title>
<book rbr="2" >
        <author> Fredrik Lundh </author>
        <price currency="eur"> 50 </price>
        <publisher> O'Reilly & Associates </publisher>
        <year> 2001 </year>
        <title> Python Standard Library </title>
    </book>

正则表达式仅匹配具有特定名称（'naslov'）的书（'knjiga'）

Regex to match only book('knjiga') with specific name('naslov')

python

regex

pattern-matching

regex-lookarounds

regex-greedy