使用xml.dom.minidom中的parseString解析XML文件效率低下?
Parsing XML file using parseString from xml.dom.minidom has poor efficiency?
我正在尝试使用 Python 2.7 解析 XML 文件。 XML 文件的大小为 370+ MB,包含 6,541,000 行。
XML 文件由以下 300K 块组成:
<Tag:Member>
<fileID id = '123456789'>
<miscTag> 123 </miscTag>
<miscTag2> 456 </miscTag2>
<DateTag> 2008-02-02 </DateTag>
<Tag2:descriptiveTerm>Keyword_1</Tag2:descriptiveTerm>
<miscTag3>6.330016</miscTag3>
<historyTag>
<DateTag>2001-04-16</DateTag>
<reasonTag>Refresh</reasonTag>
</historyTag>
<Tag3:make>Keyword_2</Tag3:make>
<miscTag4>
<miscTag5>
<Tag4:coordinates>6.090,6.000 5.490,4.300 6.090,6.000 </Tag4:coordinates>
</miscTag5>
</miscTag4>
</Tag:Member>
我使用了以下代码:
from xml.dom.minidom import parseString
def XMLParser(filePath):
""" ===== Load XML File into Memory ===== """
datafile = open(filePath)
data = datafile.read()
datafile.close()
dom = parseString(data)
length = len(dom.getElementsByTagName("Tag:Member"))
counter = 0
while counter < length:
""" ===== Extract Descriptive Term ===== """
contentString = dom.getElementsByTagName("Tag2:descriptiveTerm")[counter].toxml()
laterpart = contentString.split("Tag2:descriptiveTerm>", 1)[1]
descriptiveTerm = laterpart.split("</Tag2:descriptiveTerm>", 1)[0]
if descriptiveGroup == "Keyword_1":
""" ===== Extract Make ===== """
contentString = dom.getElementsByTagName("Tag3:make")[counter].toxml()
laterpart = contentString.split("<Tag3:make>", 1)[1]
make = laterpart.split("</Tag3:make>", 1)[0]
if descriptiveTerm == "Keyword_1" and make == "Keyword_2":
""" ===== Extract ID ===== """
contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()
laterpart = contentString.split("id=\"", 1)[1]
laterpart = laterpart.split("Tag", 1)[1]
IDString = laterpart.split("\">", 1)[0]
""" ===== Extract Coordinates ===== """
contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()
laterpart = contentString.split("coordinates>", 1)[1]
coordString = laterpart.split(" </Tag4:coordinates>", 1)[0]
counter += 1
所以,我运行这个,发现它需要大约27GB的内存,并且解析上述每个块需要20多秒。所以解析这个文件需要2个月的时间!
我想我写了一些效率低下的代码。谁能帮我改进一下?
非常感谢。
对于这种大小的文件,正确的方法是流式解析器(SAX 样式,而不是 DOM 样式,因此 minidom 完全不合适)。参见 this answer for notes on using lxml.iterparse
(a recent/modern streaming parser which uses libxml2 -- a fast and efficient XML-parsing library written in C -- on its backend) in a memory-efficient way, or the article on which that answer is based。
一般来说——当您看到与成员关联的元素时,您应该在内存中构建该成员,当您看到与标记末尾关联的事件时,您将发出或处理构建的事件内存中的内容并开始一个新的内容。
我正在尝试使用 Python 2.7 解析 XML 文件。 XML 文件的大小为 370+ MB,包含 6,541,000 行。
XML 文件由以下 300K 块组成:
<Tag:Member>
<fileID id = '123456789'>
<miscTag> 123 </miscTag>
<miscTag2> 456 </miscTag2>
<DateTag> 2008-02-02 </DateTag>
<Tag2:descriptiveTerm>Keyword_1</Tag2:descriptiveTerm>
<miscTag3>6.330016</miscTag3>
<historyTag>
<DateTag>2001-04-16</DateTag>
<reasonTag>Refresh</reasonTag>
</historyTag>
<Tag3:make>Keyword_2</Tag3:make>
<miscTag4>
<miscTag5>
<Tag4:coordinates>6.090,6.000 5.490,4.300 6.090,6.000 </Tag4:coordinates>
</miscTag5>
</miscTag4>
</Tag:Member>
我使用了以下代码:
from xml.dom.minidom import parseString
def XMLParser(filePath):
""" ===== Load XML File into Memory ===== """
datafile = open(filePath)
data = datafile.read()
datafile.close()
dom = parseString(data)
length = len(dom.getElementsByTagName("Tag:Member"))
counter = 0
while counter < length:
""" ===== Extract Descriptive Term ===== """
contentString = dom.getElementsByTagName("Tag2:descriptiveTerm")[counter].toxml()
laterpart = contentString.split("Tag2:descriptiveTerm>", 1)[1]
descriptiveTerm = laterpart.split("</Tag2:descriptiveTerm>", 1)[0]
if descriptiveGroup == "Keyword_1":
""" ===== Extract Make ===== """
contentString = dom.getElementsByTagName("Tag3:make")[counter].toxml()
laterpart = contentString.split("<Tag3:make>", 1)[1]
make = laterpart.split("</Tag3:make>", 1)[0]
if descriptiveTerm == "Keyword_1" and make == "Keyword_2":
""" ===== Extract ID ===== """
contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()
laterpart = contentString.split("id=\"", 1)[1]
laterpart = laterpart.split("Tag", 1)[1]
IDString = laterpart.split("\">", 1)[0]
""" ===== Extract Coordinates ===== """
contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()
laterpart = contentString.split("coordinates>", 1)[1]
coordString = laterpart.split(" </Tag4:coordinates>", 1)[0]
counter += 1
所以,我运行这个,发现它需要大约27GB的内存,并且解析上述每个块需要20多秒。所以解析这个文件需要2个月的时间!
我想我写了一些效率低下的代码。谁能帮我改进一下?
非常感谢。
对于这种大小的文件,正确的方法是流式解析器(SAX 样式,而不是 DOM 样式,因此 minidom 完全不合适)。参见 this answer for notes on using lxml.iterparse
(a recent/modern streaming parser which uses libxml2 -- a fast and efficient XML-parsing library written in C -- on its backend) in a memory-efficient way, or the article on which that answer is based。
一般来说——当您看到与成员关联的元素时,您应该在内存中构建该成员,当您看到与标记末尾关联的事件时,您将发出或处理构建的事件内存中的内容并开始一个新的内容。