如何找到哪个 XML 文件包含缺失的元素？

Question

Python 的新手和这里的深度学习！

我有 10,000 个 XML 文件，其中包含有关专利文件的信息（从 WIPO 获得）。我想提取每个文档的标题、摘要和分类。我已设法使用 ElementTree 做到这一点并将它们保存在 3 个列表中，但我意识到有一个文档缺少分类元素，我如何找出它是哪一个？

这是我目前的代码：

abstracts=[]
titles=[]
tags=[]

for filename in os.listdir(path):
    if not filename.endswith('.xml'): continue
    file = os.path.join(path, filename)
    tree = ET.parse(file)
    root = tree.getroot()

    for title in root.iter('invention-title'):
        titles.append(child.text)

    for abs in root.iter('abstract'):
        abstracts.append(abs.text)

    for tag in root.findall('ipc-postreform'):
        tags.append(tag.find('classification-ipc').text)

len(abstracts)
10000

len(titles)
10000

len(tags)
9999

谢谢！！

Answer 1

如果我正确理解了您的代码和用例，您可以边看边检查标题、摘要和分类。

import xml.etree.ElementTree as ET

abstracts=[]
titles=[]
tags=[]

for filename in os.listdir(path):
    if not filename.endswith('.xml'): continue
    file = os.path.join(path, filename)
    tree = ET.parse(file)
    root = tree.getroot()

    title = next(root.iter('invention-title'))
    # title = root.find('invention-title')

    abstract = next(root.iter('abstract'))
    # abstract = root.find('abstract')

    tag = next(root.findall('ipc-postreform'))

    if not tag:
        raise Exception('Tag not found for {}'.format(title))

    classification = tag.find('classification-ipc').text

    if not classification:
        raise Exception('Classification not found for {}'.format(title))

    titles.append(title)
    tags.append(classification)
    abstracts.append(abstract)

正如@Mihail Burduja 提到的那样，不需要 for 循环，所以我用一次 next() 调用替换了它们。您可以改用 find()。

如何找到哪个 XML 文件包含缺失的元素？

How to find which XML file contains the missing element?

python

xml

nlp

elementtree

deep-learning