Element Tree 的 iter() 正在跳过随机元素

Element Tree's iter() is skipping over random elements

我正在尝试使用 Python 中的 Element Tree 的 iterparse() 和 iter() 函数来解析 XML 文件。这是 Google 驱动器中文件的 link:https://drive.google.com/file/d/0B_S2Z7quow3TMl9yUk51ZzZ5UW8/view?usp=sharing.

XML 文件是法院案件数据的汇编;它被分解成一系列带有标签 "n-document," 的元素,每个元素都包含包含特定法庭案件数据的子元素。我正在尝试提取所有摘要描述。代码的简化版本如下:

import numpy as np
import pandas as pd
import xml.etree.ElementTree as etree
import re
import csv

for event, elem in etree.iterparse("***fileName***", events=("start", "end")):
    if event == "start":
        if elem.tag == "docket.entry":
            for element in elem.iter():
                print element.tag
                if element.text != None:
                    print element.text
                if element.tail != None:
                    print element.tail
                    print "from tail"
    elem.clear()

问题是在第一种情况下 (1613 HARVARD LIMITED PARTNERSHIP V. DISTRICT OF COLUMBIA ET AL),编号为 25 的摘要描述(它们按降序编号)缺少元素的文本和尾部带有标签 "gateway.image.link"。具体来说,这是我得到的输出。我刚刚在一秒钟后取消了构建并向上滚动到控制台的最顶部:

docket.entry
number.block
number
28
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
ORDER GRANTING DEFENDANTS' MOTION TO DISMISS AND DENYING PLAINTIFF'S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)
docket.entry
number.block
number
27
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)
docket.entry
number.block
number
26
image.block
image.gateway.link
gateway.image.link
date
03/31/2007
docket.description
MEMORANDUM ORDER GRANTING DEFENDANTS' MOTION
image.gateway.link
21
gateway.image.link
21
 TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS' DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)
from tail
docket.entry
number.block
number
25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link
docket.entry
number.block
number
24
image.block
image.gateway.link
gateway.image.link
date
11/14/2005
docket.description
NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #
image.gateway.link
1
gateway.image.link
1
)(MULLEN, MARTHA) (ENTERED: 11/14/2005)
from tail

在条目号 25(上面显示的输出底部的第二个)下,它说:

25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link

问题是,如果您查看 XML 文件本身,您会发现还有一个带有标签 "gateway.image.link" 的元素紧跟在 "image.gateway.link" 之后,带有文本和尾部内容,但由于某种原因,iter() 函数没有提取它。奇怪的是,大多数其他摘要描述也包含带有标签 "image.gateway.link" 的元素,紧接着是带有标签 "gateway.image.link," 的元素,正如您从条目号 24(以及其余条目)中看到的那样,以及iter() 函数识别那些但不识别这个。这是从 Google Drive 文档中摘录的 XML 代码,我在上面粘贴了一个 link:

<?xml version="1.0" encoding="UTF-8" ?><n-extract-response>
<docket.entries.block><label>Entry #:</label><label>Date:</label><label>Description:</label><docket.entry><number.block><number>28</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A1-280450912204" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>ORDER GRANTING DEFENDANTS&apos; MOTION TO DISMISS AND DENYING PLAINTIFF&apos;S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>27</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A2-2704501909813" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>26</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A4-2604501672579" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>03/31/2007</date><docket.description>MEMORANDUM ORDER GRANTING DEFENDANTS&apos; MOTION<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">21</image.gateway.link><gateway.image.link ID="B3-21-0450561212" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">21</gateway.image.link> TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS&apos; DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)</docket.description></docket.entry><docket.entry><number.block><number>25</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A6-2504501577842" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/15/2005</date><docket.description>RESPONSE TO DEFENDANTS&apos; NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B5-1-04511581037" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link> EXHIBIT 1 - NOTICE OF APPEAL)(WISE, RICHARD) (ENTERED: 11/15/2005)</docket.description></docket.entry><docket.entry><number.block><number>24</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A8-2404501579104" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/14/2005</date><docket.description>NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B7-1-04511577643" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link>)(MULLEN, MARTHA) (ENTERED: 11/14/2005)</docket.description></docket.entry></docket.entries.block>
</n-extract-response>

当我 运行 我的 Python 脚本在此特定摘录上完全按照上面粘贴时,它会得到缺少的元素。但是当我 运行 整个 XML 文件上的脚本时,它没有,如前所示。显然,该摘录缺少其上方和下方的许多元素,但我看不出这将如何影响 iter() 函数,因为我没有拆分 "docket.entry" element/sub-elements,并且这就是我代码中的 for 循环每次都应该经历的(我认为)。

问题不仅限于条目号 25——这里和那里还有一些其他提取的摘要描述缺少子元素,但我无法辨别任何模式——我无法辨别甚至说出导致问题的条目号 25 和条目号 24 之间的区别。有人可以帮忙吗?

您正在尝试在开始事件中处理元素的子元素,但 iterparse 的工作方式并不能保证它们已被读取。

documentation 对此有说明:

Note:

iterparse() only guarantees that it has seen the “>” character of a starting tag when it emits a “start” event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.

If you need a fully populated element, look for “end” events instead.

如果要处理子元素,需要在end-event上处理,否则无法保证元素的内容可用。

描述了您获得任何内容的原因here

Note:

The tree builder and the event generator are not necessarily synchronized; the latter usually lags behind a bit. This means that when you get a “start” event for an element, the builder may already have filled that element with content. You cannot rely on this, though — a “start” event can only be used to inspect the attributes, not the element content. For more details, see this message.

或许您可以选择根据文件的逻辑顺序来解析xml文件,这样您就可以准确地控制每一个元素。例如

import xml.etree.ElementTree as ET

tree = ET.parse(r'<xml file name>')
root = tree.getroot()
docket_entries = root.findall('.//docket.entry')
for entry in docket_entries:
    number = entry.find('.//number')
    print number.text
    description = entry.find('docket.description')
    print description.text
    for child in description.getchildren():
        print child

getchildren 自 2.7 版起已弃用:使用 list(elem) 或迭代。