大型 XML 文件解析 Python

Question

我有一个 XML 大小为 4 GB 的文件。我想解析它并将其转换为数据框以对其进行处理。但由于文件太大，以下代码无法将文件转换为 Pandas 数据框。代码只是不断加载，不提供任何输出。但是当我将它用于较小的类似文件时，我获得了正确的输出。

任何人都可以提出任何解决方案。也许是一个代码，可以加快从 XML 到数据帧的转换过程，或者将 XML 文件拆分成更小的子集。

关于我应该在我的个人系统（2 GB RAM）上处理如此大的 XML 文件还是应该使用 Google Colab 的任何建议。如果 Google Colab，那么有什么方法可以更快地将如此大的文件上传到驱动器，从而更快地上传到 Colab？

以下是我使用的代码：

import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()

#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']

#Creating DataFrame
df = pd.DataFrame(columns = columns)

#Converting XML Tree to a Pandas DataFrame

for node in root: 
    
    row_Id = node.attrib.get("Id")
    UserId = node.attrib.get("UserId")
    Name = node.attrib.get("Name")
    Date = node.attrib.get("Date")
    Class = node.attrib.get("Class")
    TagBased = node.attrib.get("TagBased")
    
    df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)

以下是我的 XML 文件：

<badges>
  <row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
  <row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />

Answer 1

考虑使用 cElementTree 而不是 ElementTree

https://effbot.org/zone/celementtree.htm

The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.

The cElementTree module is designed to replace the ElementTree module from the standard elementtree package. In theory, you should be able to simply change:

from elementtree import ElementTree

至

import cElementTree as ElementTree

Answer 2

考虑 iterparse 快速流式处理以增量方式构建树。在每次迭代中构建一个字典列表，然后您可以将其传递给 pandas.DataFrame 构造函数 once 外部循环。下面调整为根子节点的重复节点名称：

from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd

file_path = r"/path/to/Input.xml"
dict_list = []

for _, elem in iterparse(file_path, events=("end",)):
    if elem.tag == "row":
        dict_list.append({'rowId': elem.attrib['Id'],
                          'UserId': elem.attrib['UserId'],
                          'Name': elem.attrib['Name'],
                          'Date': elem.attrib['Date'],
                          'Class': elem.attrib['Class'],
                          'TagBased': elem.attrib['TagBased']})

        # dict_list.append(elem.attrib)      # ALTERNATIVELY, PARSE ALL ATTRIBUTES

        elem.clear()

df = pd.DataFrame(dict_list)

大型 XML 文件解析 Python

Large XML File Parsing in Python

xml

dataframe

xml-parsing

python-3.x

pandas