有没有一种方法可以让我从 XML 文件中的类似标签中获取我想要的指定数据？

Question

我有这个 XML 文件，其中包含大量数据。这是一种非常糟糕的格式，在一个属性中有多个值。

<Person> 
    <GenericItem html="Name:John&lt;br/&gt;ID: ID-001&lt;br/&gt;Position: Manager&lt;a href=&quot;mailto: john@person.com&quot;&gt;john@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-01">
Employee:
   </GenericItem>
    <GenericItem string="Hardworking and leader of the chicken division">
Summary
    </GenericItem>
    <GenericItem link ="person.com/john01">
Profile
    </GenericItem>
 </Person>
<Person> 
    <GenericItem html="Name:Anna&lt;br/&gt;ID: ID-002&lt;br/&gt;Position: Fryer&lt;a href=&quot;mailto: anna@person.com&quot;&gt;anna@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-01">
Employee:
   </GenericItem>
    <GenericItem string="Chicken fryer of the month">
Summary
    </GenericItem>
    <GenericItem link ="person.com/anna02">
Profile
    </GenericItem>
 </Person>
<Person> 
    <GenericItem html="Name:Kent&lt;br/&gt;ID: ID-003&lt;br/&gt;Position: Cleaner&lt;a href=&quot;mailto: kent@person.com&quot;&gt;kent@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-02">
Employee:
   </GenericItem>
    <GenericItem string="chicken and office cleaner">
Summary
    </GenericItem>
    <GenericItem link ="person.com/kent03">
Profile
    </GenericItem>
 </Person>

现在，数据不是全部，因为它会太多。我想要得到的只是“姓名”、“ID”和“职位”。这意味着在 GenericItem 中除了 3 之外不需要并且需要删除并且具有属性“string”和“link”的 GenericItem 是无用的，我想删除它。我尝试使用 Etree del 方法，但我没有删除它们。

import xml.etree.ElementTree as ET
tree = ET.parse('NewestReport.xml')
root = tree.getroot()
for GenericItem in tree.findall('GenericItem'):
    del(GenericItem.attrib['string'])
for neighbor in root.iter('GenericItem'):
    print(neighbor.attrib)

还有其他方法可以尝试吗？

Answer 1

您需要HTML-解析属性值。

您最好的选择是从内置的 ElementTree 切换到 lxml，因为它包括 XML 和 HTML 解析器，以及适当的 XPath 支持。

我在这里将您的测试输入解析为 XML，并将每个 @html 属性分别解析为 HTML。之后，选择包含 ':' 的文本节点似乎是一个很好的初步近似。当然，您可以以不同的方式剖析 HTML 树。

from lxml import etree as ET

html_parser = ET.HTMLParser()

tree = ET.parse('test.xml')

for person in tree.xpath('./Person'):
    print('-' * 40)
    for html in person.xpath('./GenericItem/@html')
        data = ET.fromstring(html, html_parser)
        for text in data.xpath('.//text()[contains(., ":")]'):
            print(text.strip())

打印

----------------------------------------
Name:John
ID: ID-001
Position: Manager
Division: chicken-01
----------------------------------------
Name:Anna
ID: ID-002
Position: Fryer
Division: chicken-01
----------------------------------------
Name:Kent
ID: ID-003
Position: Cleaner
Division: chicken-02

有没有一种方法可以让我从 XML 文件中的类似标签中获取我想要的指定数据？

Is there a way that I can just get specified data that I want from similar tags inside an XML file?

xml

elementtree

xml-parsing

xml.etree