我如何优化此 XML 解析循环以提高速度?

How do I optimize this XML parsing loop for speed?

我写了一段代码来解析大约一百个 XML 文件并创建一个数据帧。该代码工作正常,但可能需要相当长的时间,不到一个小时 运行。我确信有一种方法可以通过仅在循环末尾使用数据帧对象来改进此循环,或者您可能不需要三重嵌套循环将所有信息解析到数据帧中,但这是唯一的我作为新手能够做到的方式。

我的代码如下所示:

from bs4 import BeautifulSoup
import pandas as pd
import lxml
import json
import os

os.chdir(r"path_to_output_file/output_file")
f_list = os.listdir()

df_list = []

output_files = []
# checking we only itterate through XML files containing "calc_output"
for calc_output in f_list:
    if "calc_output" in calc_output and calc_output.endswith(".xml"):
        output_files.append(calc_output)
        
for calc_output in output_files:
    with open(calc_output, "r") as datas:
        print(f"reading file {calc_output} ...")

        doc = BeautifulSoup(datas.read(), "lxml")

        rows = []
        timestamps = doc.time.find_all("timestamp")
        for timestamp in timestamps: # parsing through every timestamp element
            row= {}
            time = timestamp.get("time") # reading timestamp attributes
            temperature = timestamp.get("temperature")
            zone_id = doc.zone.get("zone_id")
            time_id = timestamp.get("time_id")
            row.update({"time":time, "temperature":temperature, "time_id":time_id, "zone_id":zone_id})
            row_copy = row.copy()
            rows.append(row_copy)

        # creating temporary dataframe to combine with other info
        df1 = pd.DataFrame(rows)

        rows= []
        surfacedatas = doc.surfacehistory.find_all("surfacedata")
        for surfacedata in surfacedatas:
            row= {}
            #parsing through every surfacedata element
            time_begin = surfacedata.get("time-begin")
            time_end = surfacedata.get("time-end")
            row={"time-begin":time_begin, "time-end":time_end}

            things = surfacedata.find_all("thing", recursive=False)
            #parsing through every thing in each surfacedata
            for thing in things:
                identity = id2name(thing.get("identity"))
                row.update({"identity":identity})

                locations = thing.find_all("loc ation", recursive=False)
                for location in locations:
                    #parsing through every location for every thing for each surfacedata
                    l_identity = location.get("l_identity")
                    surface = location.getText()
                    row.update({"l_identity":l_identity, "surface":surface})
                    row_copy = row.copy()
                    rows.append(row_copy)
        df2 = pd.DataFrame(rows) # second dataframe containing the information needed

    #merging each dataframe on every loop
    df =pd.merge(df1,df2, left_on="time_id", right_on="time-begin") 
    # then appending it to a list
    df_list.append(df)

# final dataframe created by concatenating each dataframe from each output file
df = pd.concat(df_list)
df

XML 文件的示例为:

文件 1

<file filename="stack_example_1" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone zone_id="10">
        <time>
            <timestamp time_id="1" time="0" temperature="100"/>
            <timestamp time_id="2" time="10.00" temperature="200"/>
        </time>
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l_identity="2"> 1.256</location>
                    <location l_identity="45"> 2.3</location>
                </thing>
                <thing identity="3">
                    <location l_identity="2"> 1.6</location>
                    <location l_identity="5"> 2.5</location> 
                    <location l_identity="78"> 3.2</location>
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l_identity="17"> 2.4</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>

文件 2

<file filename="stack_example_2" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone zone_id="11">
        <time>
            <timestamp time_id="1" time="0" temperature="100"/>
            <timestamp time_id="2" time="10.00" temperature="200"/>
        </time>
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l-identity="2"> 1.6</location>
                    <location l-identity="45"> 2.6</location>
                </thing>
                <thing identity="3">
                    <location l-identity="2"> 1.4</location>
                    <location l-identity="8"> 2.7</location>  
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l-identity="9"> 2.8</location>
                    <location l-identity="17"> 1.2</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>

此代码使用 文件 1文件 2 的输出为:

zone_id     time       time_id  temperature tid-begin   tid-end    identity  location   surface
10           0          1       100         1           2          1        2           1,256
10           0          1       100         1           2          1        2           2,3
10           0          1       100         1           2          3        2           1,6
10           0          1       100         1           2          3        5           2,5
10           0          1       100         1           2          3        78          3,2
10           10         2       200         2           3          1        17          2,4
11           0          1       100         1           2          1        2           1,6
11           0          1       100         1           2          1        45          2,6
11           0          1       100         1           2          3        2           1,4
11           0          1       100         1           2          3        8           2,7
11           10         2       200         2           3          1        9           2,8
11           10         2       200         2           3          1        17          1,2

这是运行ning cProfile 后获得的输出:

      Ordered by: internal time
   List reduced from 6281 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   214204   95.337    0.000   95.340    0.000 C:\Users\anon\Anaconda3\lib\json\decoder.py:343(raw_decode)
   214389   20.685    0.000   21.386    0.000 {built-in method io.open}
   214288   17.945    0.000   17.945    0.000 {built-in method _codecs.charmap_decode}
        1   16.745   16.745  336.360  336.360 .\anon_programm.py:7(<module>)
       10   15.378    1.538  132.814   13.281 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:330(feed)
 10277616   12.975    0.000   44.266    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:555(endData)
   214228   12.504    0.000   30.575    0.000 {method 'read' of '_io.TextIOWrapper' objects}
  3425862   11.257    0.000   75.608    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:223(start)
  6851244   10.806    0.000   19.427    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:589(object_was_parsed)
 17128360    8.580    0.000    8.580    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:158(setup)
  3425862    8.389    0.000    8.694    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:527(popTag)
  5961888    7.170    0.000    7.170    0.000 {method 'keys' of 'dict' objects}
  3425872    7.072    0.000   23.054    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:1152(__init__)
   214200    5.978    0.000  146.468    0.001 .\anon_programm.py:18(id2name)
  3425862    5.913    0.000   61.118    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:691(handle_starttag)
  3425002    4.482    0.000   12.571    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\__init__.py:285(_replace_cdata_list_attribute_values)
  3425862    4.326    0.000   37.251    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:278(end)
  3425862    4.244    0.000   13.552    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:657(_popToTag)
  2751774    4.240    0.000    6.154    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:808(<genexpr>)
  6851244    3.869    0.000    8.629    0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:932(__new__)

这是在循环中被多次调用的函数:

import functools

@functools.lru_cache(maxsize=1000)
def id2name(id):
    name_Dict = json.loads( open(r"path_to_JSON_file\file.json","r").read() )
    name = ""
    if id.isnumeric():
        partial_id = id[:-1]  
        if partial_id not in name_Dict.keys():
            return id
        if id[-1] == "0":
            return  name_Dict[partial_id]
        else:
            return  name_Dict[partial_id]+"x"+id[-1]
    else:
        return ""

正如对您的问题的评论中所指出的,大部分时间都花在了解码您的 id2name 函数中的 JSON 上。虽然函数的结果被缓存,但解析的 JSON 对象没有缓存,这意味着您正在从磁盘加载 JSON 文件并在每次查找新 ID 时解析它。

假设您每次加载相同的 JSON 文件,这意味着您应该通过缓存已解析的 JSON 对象立即获得 speed-up。您可以通过如下重组您的 id2name 函数来做到这一点。

import functools

@functools.lru_cache()
def load_name_dict():
    with open(r"path_to_JSON_file\file.json", "r", encoding="utf-8") as f: 
        return json.load(f)

@functools.lru_cache(maxsize=1000)
def id2name(thing_id):
    if not thing_id.isnumeric():
        return ""
    name_dict = load_name_dict()
    name = name_dict.get(thing_id[:-1])
    if name is None:
        return thing_id
    last_char = thing_id[-1]
    if last_char == "0":
        return name
    else:
        return name + "x" + last_char

请注意,我已经重构了 id2name 函数,以便在 ID 为 non-numeric 时不加载 JSON 对象。我还将其更改为使用 .get 方法而不是 in 以避免不必要的字典查找。此外,我将 id 更改为 thing_id,因为 id 是 Python 中的 built-in 函数。

此外,由于您的输入文件似乎是有效的 XML,您可以通过直接使用 lxml 而不是通过 BeautifulSoup 来节省更多时间。或者更好的是,您可以使用 pandas.read_xml 将 XML 直接加载到数据帧中。不过需要注意的是;您应该分析生成的代码以检查它是否确实运行得更快,而不是相信我的话。关于性能的直觉是出了名的不可靠;你应该经常测量它。