大型 XML 文件解析 Python
Large XML File Parsing in Python
我有一个 XML 大小为 4 GB 的文件。我想解析它并将其转换为数据框以对其进行处理。但由于文件太大,以下代码无法将文件转换为 Pandas 数据框。代码只是不断加载,不提供任何输出。但是当我将它用于较小的类似文件时,我获得了正确的输出。
任何人都可以提出任何解决方案。也许是一个代码,可以加快从 XML 到数据帧的转换过程,或者将 XML 文件拆分成更小的子集。
关于我应该在我的个人系统(2 GB RAM)上处理如此大的 XML 文件还是应该使用 Google Colab 的任何建议。如果 Google Colab,那么有什么方法可以更快地将如此大的文件上传到驱动器,从而更快地上传到 Colab?
以下是我使用的代码:
import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()
#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']
#Creating DataFrame
df = pd.DataFrame(columns = columns)
#Converting XML Tree to a Pandas DataFrame
for node in root:
row_Id = node.attrib.get("Id")
UserId = node.attrib.get("UserId")
Name = node.attrib.get("Name")
Date = node.attrib.get("Date")
Class = node.attrib.get("Class")
TagBased = node.attrib.get("TagBased")
df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)
以下是我的 XML 文件:
<badges>
<row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
<row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
考虑使用 cElementTree
而不是 ElementTree
https://effbot.org/zone/celementtree.htm
The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.
The cElementTree module is designed to replace the ElementTree module from the standard elementtree package. In theory, you should be able to simply change:
from elementtree import ElementTree
至
import cElementTree as ElementTree
考虑 iterparse
快速流式处理以增量方式构建树。在每次迭代中构建一个字典列表,然后您可以将其传递给 pandas.DataFrame
构造函数 once 外部循环。下面调整为根子节点的重复节点名称:
from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd
file_path = r"/path/to/Input.xml"
dict_list = []
for _, elem in iterparse(file_path, events=("end",)):
if elem.tag == "row":
dict_list.append({'rowId': elem.attrib['Id'],
'UserId': elem.attrib['UserId'],
'Name': elem.attrib['Name'],
'Date': elem.attrib['Date'],
'Class': elem.attrib['Class'],
'TagBased': elem.attrib['TagBased']})
# dict_list.append(elem.attrib) # ALTERNATIVELY, PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)
我有一个 XML 大小为 4 GB 的文件。我想解析它并将其转换为数据框以对其进行处理。但由于文件太大,以下代码无法将文件转换为 Pandas 数据框。代码只是不断加载,不提供任何输出。但是当我将它用于较小的类似文件时,我获得了正确的输出。
任何人都可以提出任何解决方案。也许是一个代码,可以加快从 XML 到数据帧的转换过程,或者将 XML 文件拆分成更小的子集。
关于我应该在我的个人系统(2 GB RAM)上处理如此大的 XML 文件还是应该使用 Google Colab 的任何建议。如果 Google Colab,那么有什么方法可以更快地将如此大的文件上传到驱动器,从而更快地上传到 Colab?
以下是我使用的代码:
import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()
#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']
#Creating DataFrame
df = pd.DataFrame(columns = columns)
#Converting XML Tree to a Pandas DataFrame
for node in root:
row_Id = node.attrib.get("Id")
UserId = node.attrib.get("UserId")
Name = node.attrib.get("Name")
Date = node.attrib.get("Date")
Class = node.attrib.get("Class")
TagBased = node.attrib.get("TagBased")
df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)
以下是我的 XML 文件:
<badges>
<row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
<row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
考虑使用 cElementTree
而不是 ElementTree
https://effbot.org/zone/celementtree.htm
The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.
The cElementTree module is designed to replace the ElementTree module from the standard elementtree package. In theory, you should be able to simply change:
from elementtree import ElementTree
至
import cElementTree as ElementTree
考虑 iterparse
快速流式处理以增量方式构建树。在每次迭代中构建一个字典列表,然后您可以将其传递给 pandas.DataFrame
构造函数 once 外部循环。下面调整为根子节点的重复节点名称:
from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd
file_path = r"/path/to/Input.xml"
dict_list = []
for _, elem in iterparse(file_path, events=("end",)):
if elem.tag == "row":
dict_list.append({'rowId': elem.attrib['Id'],
'UserId': elem.attrib['UserId'],
'Name': elem.attrib['Name'],
'Date': elem.attrib['Date'],
'Class': elem.attrib['Class'],
'TagBased': elem.attrib['TagBased']})
# dict_list.append(elem.attrib) # ALTERNATIVELY, PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)