Iterparse big XML,内存占用少,并获取所有甚至嵌套的序列元素
Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements
我写了一个小 python 脚本来根据 Python 中的 Liza Daly's blog 解析 XML 数据。但是,我的代码不解析 all 个节点。因此,例如,当一个人有多个地址时,它只需要第一个可用地址。 XML 树看起来像这样:
- lgs
- entities
- entity
- id
- name
- addressess
- address
- address1
- address
- address1
- entity
- id
(...)
这就是 python 脚本:
import os
import time
from datetime import datetime
import lxml.etree as ET
import pandas as pd
xml_file = '.\FILE.XML'
file_name, file_extension = os.path.splitext(os.path.basename(xml_file))
def fast_iter(context, *args, **kwargs):
# iterate through the XML file and create a dictionary
xml_dict = { "Id":[]
, "name":[]
, "address":[]
, "type":[] }
def try_to_append(xml_column, node):
# find and append XML nodes to the empty dictionary
try:
xml_dict[xml_column].append(elem.find(node).text)
except:
xml_dict[xml_column].append('')
for event, elem in context:
try_to_append('Id', 'id')
try_to_append('name', 'name')
try_to_append('address', 'addresses/address/address1'
elem.clear()
for ancestor in elem.xpath("ancestor-or-self::*"):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
return xml_dict
context = ET.iterparse(xml_file, tag='entity')
xml_dict = fast_iter(context)
df = pd.DataFrame(xml_dict)
所以我得到的是:
ID | name | address | Type
1 | John Doe | Pythonstreet 4 | A
但是 John Doe 的名字上有 3 个住址。所以我希望:
ID | name | address | Type
1 | John Doe | Pythonstreet 4 | A
1 | John Doe | SQL street 33 | A
1 | John Doe | C++ street 99 | A
正如你所看到的,因为唯一额外的节点是地址,其余的都没有改变,只有地址被额外添加。 SQL 当您将 OPENROWSET() 与 BULK 一起使用时,服务器会自动执行此操作。但是,此文件大于 3GB,因此无法正常工作。
更新:XML 示例(假地址)
<?xml version='1.0' encoding='UTF-8'?>
<ffl>
<version>12345</version>
<entities>
<entity id="1124353" version="12345">
<name>DAVID, Beckham</name>
<footId>1021</footId>
<footCode>FIF</footCode>
<createdDate>09/02/1991</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Leytonstone, London, United Kingdom</pob>
</pobs>
<dobs>
<dob>May 02, 1975</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address></address>
<city>London</city>
<country>UK</country>
<countryName>UNITED KINGDOM</countryName>
<postalCode>SE10 0JF</postalCode>
</address>
<address>
<address1>35-37 Parkgate Road</address1>
<city>London</city>
<country>UK</country>
<countryName>UNITED KINGDOM</countryName>
<postalCode>SW11 4NP</postalCode>
</address>
</addresses>
</entity>
<entity id="1184359" version="12345">
<name>CRISTIANO, Ronaldo</name>
<footId>1022</footId>
<footCode>FIF</footCode>
<createdDate>20/03/2003</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Funchal, Madeira, Portugal</pob>
</pobs>
<dobs>
<dob>February 05, 1985</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address>Avenida da Boavista 1837</address>
<city>Porto</city>
<country>PT</country>
<countryName>PORTUGAL</countryName>
<postalCode>4100-133</postalCode>
</address>
<address>
<address1>Extramuros 74</address1>
<city>Madrid</city>
<country>ES</country>
<countryName>Spain</countryName>
<postalCode>28400</postalCode>
</address>
<address>
<address1>Viale Certosa 29</address1>
<city>Turin</city>
<country>IT</country>
<countryName>Italy</countryName>
<postalCode>10093</postalCode>
</address>
</addresses>
</entity>
<entity id="1984359" version="12345">
<name>LIONEL, Messi</name>
<footId>1023</footId>
<footCode>FIF</footCode>
<createdDate>09/02/2008</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Rosario, Argentina</pob>
</pobs>
<dobs>
<dob>June 24, 1987</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address>Almeyra 2588</address>
<city>San Martin</city>
<state>Buenos Aires</state>
<country>AR</country>
<countryName>ARGENTINA</countryName>
<postalCode>N/A</postalCode>
</address>
<address>
<address1>Comandante Izarduy 67</address1>
<city>Barcelona</city>
<country>ES</country>
<countryName>SPAIN</countryName>
<postalCode>08940</postalCode>
</address>
<address>
<address1>Humahuaca 4425</address1>
<city>Buenos Aires</city>
<country>AR</country>
<countryName>ARGENTINA</countryName>
<postalCode>N/A</postalCode>
</address>
</addresses>
</entity>
</entities>
</ffl>
Comment: As it now only outputs results
输出结果仅用于演示、跟踪和调试。
要将 record
和 addresses
写入 SQL
数据库,例如使用 sqlite3
,执行:
c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
addr[1].update({'id': record['id']})
addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)
为 pandas
展平
循环外前置条件: df = pd.DataFrame()
from copy import copy
addresses = copy(record['addresses'])
del record['addresses']
df_records = []
for addr in addresses:
record.update(addr[1])
df_records.append(record)
df = df.append(df_records, ignore_index=True)
Question: Use etree.iterparse
to include all nodes in XML file
以下 class Entity
做:
- 使用
lxml.etree.iterparse
解析 XML
文件。
- 没有文件大小限制,因为
<entity>...</entity>
元素树在处理后 被删除 。
- 从每个
<entity>...</entity>
树构建一个 dict {tag, value, ...}
。
- 使用
generator objects
到 yield
dict
。
- 序列元素,例如
<addresses>/<address>
是元组列表 [(address, {tag, text})...
。
ToDo:
- To flatten into many Records, loop
record['addresses']
- To equal different tag names:
address
and address1
- To flatten, Sequence tags, e.g.
<titels>
, <probs>
and <dobs>
from lxml import etree
class Entity:
def __init__(self, fh):
"""
Initialize 'iterparse' to only generate 'end' events on tag '<entity>'
:param fh: File Handle from the XML File to parse
"""
self.context = etree.iterparse(fh, events=("end",), tag=['entity'])
def _parse(self):
"""
Parse the XML File for all '<entity>...</entity>' Elements
Clear/Delete the Element Tree after processing
:return: Yield the current '<entity>...</entity>' Element Tree
"""
for event, elem in self.context:
yield elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
def sequence(self, elements):
"""
Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
If found a nested Sequence Element, e.g. <address>,
to a Tuple ('address', {tag, text})
:param elements: The Sequence Element
:return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
"""
_elements = []
for elem in elements:
if len(elem):
_elements.append((elem.tag, dict(self.sequence(elem))))
else:
_elements.append((elem.tag, elem.text))
return _elements
def __iter__(self):
"""
Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()
:return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
"""
for xml_entity in self._parse():
entity = {'id': xml_entity.attrib['id']}
for elem in xml_entity:
# if elem is Sequence
if len(elem):
# Append tuple(tag, value)
entity[elem.tag] = self.sequence(elem)
else:
entity[elem.tag] = elem.text
yield entity
if __name__ == "__main__":
with open('.\FILE.XML', 'rb') as in_xml_
for record in Entity(in_xml):
print("record:{}".format(record))
for key, value in record.items():
if isinstance(value, (list)):
#print_list(key, value)
print("{}:{}".format(key, value))
else:
print("{}:{}".format(key, value))
Output: Shows only the first Record and only 4 fields.
Note: There is a pitfall with unique tag names: address
and address1
record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity)
id:1124353
name:DAVID, Beckham
titles:[('title', 'Football player')]
addresses:
address:{'city': 'London', 'address': None, 'post... (omitted for brevity)
address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)
测试 Python:3.5 - lxml.etree:3.7.1
我写了一个小 python 脚本来根据 Python 中的 Liza Daly's blog 解析 XML 数据。但是,我的代码不解析 all 个节点。因此,例如,当一个人有多个地址时,它只需要第一个可用地址。 XML 树看起来像这样:
- lgs
- entities
- entity
- id
- name
- addressess
- address
- address1
- address
- address1
- entity
- id
(...)
这就是 python 脚本:
import os
import time
from datetime import datetime
import lxml.etree as ET
import pandas as pd
xml_file = '.\FILE.XML'
file_name, file_extension = os.path.splitext(os.path.basename(xml_file))
def fast_iter(context, *args, **kwargs):
# iterate through the XML file and create a dictionary
xml_dict = { "Id":[]
, "name":[]
, "address":[]
, "type":[] }
def try_to_append(xml_column, node):
# find and append XML nodes to the empty dictionary
try:
xml_dict[xml_column].append(elem.find(node).text)
except:
xml_dict[xml_column].append('')
for event, elem in context:
try_to_append('Id', 'id')
try_to_append('name', 'name')
try_to_append('address', 'addresses/address/address1'
elem.clear()
for ancestor in elem.xpath("ancestor-or-self::*"):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
return xml_dict
context = ET.iterparse(xml_file, tag='entity')
xml_dict = fast_iter(context)
df = pd.DataFrame(xml_dict)
所以我得到的是:
ID | name | address | Type
1 | John Doe | Pythonstreet 4 | A
但是 John Doe 的名字上有 3 个住址。所以我希望:
ID | name | address | Type
1 | John Doe | Pythonstreet 4 | A
1 | John Doe | SQL street 33 | A
1 | John Doe | C++ street 99 | A
正如你所看到的,因为唯一额外的节点是地址,其余的都没有改变,只有地址被额外添加。 SQL 当您将 OPENROWSET() 与 BULK 一起使用时,服务器会自动执行此操作。但是,此文件大于 3GB,因此无法正常工作。
更新:XML 示例(假地址)
<?xml version='1.0' encoding='UTF-8'?>
<ffl>
<version>12345</version>
<entities>
<entity id="1124353" version="12345">
<name>DAVID, Beckham</name>
<footId>1021</footId>
<footCode>FIF</footCode>
<createdDate>09/02/1991</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Leytonstone, London, United Kingdom</pob>
</pobs>
<dobs>
<dob>May 02, 1975</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address></address>
<city>London</city>
<country>UK</country>
<countryName>UNITED KINGDOM</countryName>
<postalCode>SE10 0JF</postalCode>
</address>
<address>
<address1>35-37 Parkgate Road</address1>
<city>London</city>
<country>UK</country>
<countryName>UNITED KINGDOM</countryName>
<postalCode>SW11 4NP</postalCode>
</address>
</addresses>
</entity>
<entity id="1184359" version="12345">
<name>CRISTIANO, Ronaldo</name>
<footId>1022</footId>
<footCode>FIF</footCode>
<createdDate>20/03/2003</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Funchal, Madeira, Portugal</pob>
</pobs>
<dobs>
<dob>February 05, 1985</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address>Avenida da Boavista 1837</address>
<city>Porto</city>
<country>PT</country>
<countryName>PORTUGAL</countryName>
<postalCode>4100-133</postalCode>
</address>
<address>
<address1>Extramuros 74</address1>
<city>Madrid</city>
<country>ES</country>
<countryName>Spain</countryName>
<postalCode>28400</postalCode>
</address>
<address>
<address1>Viale Certosa 29</address1>
<city>Turin</city>
<country>IT</country>
<countryName>Italy</countryName>
<postalCode>10093</postalCode>
</address>
</addresses>
</entity>
<entity id="1984359" version="12345">
<name>LIONEL, Messi</name>
<footId>1023</footId>
<footCode>FIF</footCode>
<createdDate>09/02/2008</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Rosario, Argentina</pob>
</pobs>
<dobs>
<dob>June 24, 1987</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address>Almeyra 2588</address>
<city>San Martin</city>
<state>Buenos Aires</state>
<country>AR</country>
<countryName>ARGENTINA</countryName>
<postalCode>N/A</postalCode>
</address>
<address>
<address1>Comandante Izarduy 67</address1>
<city>Barcelona</city>
<country>ES</country>
<countryName>SPAIN</countryName>
<postalCode>08940</postalCode>
</address>
<address>
<address1>Humahuaca 4425</address1>
<city>Buenos Aires</city>
<country>AR</country>
<countryName>ARGENTINA</countryName>
<postalCode>N/A</postalCode>
</address>
</addresses>
</entity>
</entities>
</ffl>
Comment: As it now only outputs results
输出结果仅用于演示、跟踪和调试。
要将 record
和 addresses
写入 SQL
数据库,例如使用 sqlite3
,执行:
c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
addr[1].update({'id': record['id']})
addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)
为 pandas
展平
循环外前置条件: df = pd.DataFrame()
from copy import copy
addresses = copy(record['addresses'])
del record['addresses']
df_records = []
for addr in addresses:
record.update(addr[1])
df_records.append(record)
df = df.append(df_records, ignore_index=True)
Question: Use
etree.iterparse
to include all nodes in XML file
以下 class Entity
做:
- 使用
lxml.etree.iterparse
解析XML
文件。 - 没有文件大小限制,因为
<entity>...</entity>
元素树在处理后 被删除 。 - 从每个
<entity>...</entity>
树构建一个dict {tag, value, ...}
。 - 使用
generator objects
到yield
dict
。 - 序列元素,例如
<addresses>/<address>
是元组列表[(address, {tag, text})...
。
ToDo:
- To flatten into many Records, loop
record['addresses']
- To equal different tag names:
address
andaddress1
- To flatten, Sequence tags, e.g.
<titels>
,<probs>
and<dobs>
from lxml import etree
class Entity:
def __init__(self, fh):
"""
Initialize 'iterparse' to only generate 'end' events on tag '<entity>'
:param fh: File Handle from the XML File to parse
"""
self.context = etree.iterparse(fh, events=("end",), tag=['entity'])
def _parse(self):
"""
Parse the XML File for all '<entity>...</entity>' Elements
Clear/Delete the Element Tree after processing
:return: Yield the current '<entity>...</entity>' Element Tree
"""
for event, elem in self.context:
yield elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
def sequence(self, elements):
"""
Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
If found a nested Sequence Element, e.g. <address>,
to a Tuple ('address', {tag, text})
:param elements: The Sequence Element
:return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
"""
_elements = []
for elem in elements:
if len(elem):
_elements.append((elem.tag, dict(self.sequence(elem))))
else:
_elements.append((elem.tag, elem.text))
return _elements
def __iter__(self):
"""
Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()
:return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
"""
for xml_entity in self._parse():
entity = {'id': xml_entity.attrib['id']}
for elem in xml_entity:
# if elem is Sequence
if len(elem):
# Append tuple(tag, value)
entity[elem.tag] = self.sequence(elem)
else:
entity[elem.tag] = elem.text
yield entity
if __name__ == "__main__":
with open('.\FILE.XML', 'rb') as in_xml_
for record in Entity(in_xml):
print("record:{}".format(record))
for key, value in record.items():
if isinstance(value, (list)):
#print_list(key, value)
print("{}:{}".format(key, value))
else:
print("{}:{}".format(key, value))
Output: Shows only the first Record and only 4 fields.
Note: There is a pitfall with unique tag names:address
andaddress1
record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity) id:1124353 name:DAVID, Beckham titles:[('title', 'Football player')] addresses: address:{'city': 'London', 'address': None, 'post... (omitted for brevity) address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)
测试 Python:3.5 - lxml.etree:3.7.1