我如何优化此 XML 解析循环以提高速度?
How do I optimize this XML parsing loop for speed?
我写了一段代码来解析大约一百个 XML 文件并创建一个数据帧。该代码工作正常,但可能需要相当长的时间,不到一个小时 运行。我确信有一种方法可以通过仅在循环末尾使用数据帧对象来改进此循环,或者您可能不需要三重嵌套循环将所有信息解析到数据帧中,但这是唯一的我作为新手能够做到的方式。
我的代码如下所示:
from bs4 import BeautifulSoup
import pandas as pd
import lxml
import json
import os
os.chdir(r"path_to_output_file/output_file")
f_list = os.listdir()
df_list = []
output_files = []
# checking we only itterate through XML files containing "calc_output"
for calc_output in f_list:
if "calc_output" in calc_output and calc_output.endswith(".xml"):
output_files.append(calc_output)
for calc_output in output_files:
with open(calc_output, "r") as datas:
print(f"reading file {calc_output} ...")
doc = BeautifulSoup(datas.read(), "lxml")
rows = []
timestamps = doc.time.find_all("timestamp")
for timestamp in timestamps: # parsing through every timestamp element
row= {}
time = timestamp.get("time") # reading timestamp attributes
temperature = timestamp.get("temperature")
zone_id = doc.zone.get("zone_id")
time_id = timestamp.get("time_id")
row.update({"time":time, "temperature":temperature, "time_id":time_id, "zone_id":zone_id})
row_copy = row.copy()
rows.append(row_copy)
# creating temporary dataframe to combine with other info
df1 = pd.DataFrame(rows)
rows= []
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
row= {}
#parsing through every surfacedata element
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
row={"time-begin":time_begin, "time-end":time_end}
things = surfacedata.find_all("thing", recursive=False)
#parsing through every thing in each surfacedata
for thing in things:
identity = id2name(thing.get("identity"))
row.update({"identity":identity})
locations = thing.find_all("loc ation", recursive=False)
for location in locations:
#parsing through every location for every thing for each surfacedata
l_identity = location.get("l_identity")
surface = location.getText()
row.update({"l_identity":l_identity, "surface":surface})
row_copy = row.copy()
rows.append(row_copy)
df2 = pd.DataFrame(rows) # second dataframe containing the information needed
#merging each dataframe on every loop
df =pd.merge(df1,df2, left_on="time_id", right_on="time-begin")
# then appending it to a list
df_list.append(df)
# final dataframe created by concatenating each dataframe from each output file
df = pd.concat(df_list)
df
XML 文件的示例为:
文件 1
<file filename="stack_example_1" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone zone_id="10">
<time>
<timestamp time_id="1" time="0" temperature="100"/>
<timestamp time_id="2" time="10.00" temperature="200"/>
</time>
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l_identity="2"> 1.256</location>
<location l_identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l_identity="2"> 1.6</location>
<location l_identity="5"> 2.5</location>
<location l_identity="78"> 3.2</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l_identity="17"> 2.4</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
文件 2
<file filename="stack_example_2" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone zone_id="11">
<time>
<timestamp time_id="1" time="0" temperature="100"/>
<timestamp time_id="2" time="10.00" temperature="200"/>
</time>
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.6</location>
<location l-identity="45"> 2.6</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.4</location>
<location l-identity="8"> 2.7</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="9"> 2.8</location>
<location l-identity="17"> 1.2</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
此代码使用 文件 1 和 文件 2 的输出为:
zone_id time time_id temperature tid-begin tid-end identity location surface
10 0 1 100 1 2 1 2 1,256
10 0 1 100 1 2 1 2 2,3
10 0 1 100 1 2 3 2 1,6
10 0 1 100 1 2 3 5 2,5
10 0 1 100 1 2 3 78 3,2
10 10 2 200 2 3 1 17 2,4
11 0 1 100 1 2 1 2 1,6
11 0 1 100 1 2 1 45 2,6
11 0 1 100 1 2 3 2 1,4
11 0 1 100 1 2 3 8 2,7
11 10 2 200 2 3 1 9 2,8
11 10 2 200 2 3 1 17 1,2
这是运行ning cProfile 后获得的输出:
Ordered by: internal time
List reduced from 6281 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
214204 95.337 0.000 95.340 0.000 C:\Users\anon\Anaconda3\lib\json\decoder.py:343(raw_decode)
214389 20.685 0.000 21.386 0.000 {built-in method io.open}
214288 17.945 0.000 17.945 0.000 {built-in method _codecs.charmap_decode}
1 16.745 16.745 336.360 336.360 .\anon_programm.py:7(<module>)
10 15.378 1.538 132.814 13.281 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:330(feed)
10277616 12.975 0.000 44.266 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:555(endData)
214228 12.504 0.000 30.575 0.000 {method 'read' of '_io.TextIOWrapper' objects}
3425862 11.257 0.000 75.608 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:223(start)
6851244 10.806 0.000 19.427 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:589(object_was_parsed)
17128360 8.580 0.000 8.580 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:158(setup)
3425862 8.389 0.000 8.694 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:527(popTag)
5961888 7.170 0.000 7.170 0.000 {method 'keys' of 'dict' objects}
3425872 7.072 0.000 23.054 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:1152(__init__)
214200 5.978 0.000 146.468 0.001 .\anon_programm.py:18(id2name)
3425862 5.913 0.000 61.118 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:691(handle_starttag)
3425002 4.482 0.000 12.571 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\__init__.py:285(_replace_cdata_list_attribute_values)
3425862 4.326 0.000 37.251 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:278(end)
3425862 4.244 0.000 13.552 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:657(_popToTag)
2751774 4.240 0.000 6.154 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:808(<genexpr>)
6851244 3.869 0.000 8.629 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:932(__new__)
这是在循环中被多次调用的函数:
import functools
@functools.lru_cache(maxsize=1000)
def id2name(id):
name_Dict = json.loads( open(r"path_to_JSON_file\file.json","r").read() )
name = ""
if id.isnumeric():
partial_id = id[:-1]
if partial_id not in name_Dict.keys():
return id
if id[-1] == "0":
return name_Dict[partial_id]
else:
return name_Dict[partial_id]+"x"+id[-1]
else:
return ""
正如对您的问题的评论中所指出的,大部分时间都花在了解码您的 id2name 函数中的 JSON 上。虽然函数的结果被缓存,但解析的 JSON 对象没有缓存,这意味着您正在从磁盘加载 JSON 文件并在每次查找新 ID 时解析它。
假设您每次加载相同的 JSON 文件,这意味着您应该通过缓存已解析的 JSON 对象立即获得 speed-up。您可以通过如下重组您的 id2name 函数来做到这一点。
import functools
@functools.lru_cache()
def load_name_dict():
with open(r"path_to_JSON_file\file.json", "r", encoding="utf-8") as f:
return json.load(f)
@functools.lru_cache(maxsize=1000)
def id2name(thing_id):
if not thing_id.isnumeric():
return ""
name_dict = load_name_dict()
name = name_dict.get(thing_id[:-1])
if name is None:
return thing_id
last_char = thing_id[-1]
if last_char == "0":
return name
else:
return name + "x" + last_char
请注意,我已经重构了 id2name 函数,以便在 ID 为 non-numeric 时不加载 JSON 对象。我还将其更改为使用 .get
方法而不是 in
以避免不必要的字典查找。此外,我将 id
更改为 thing_id
,因为 id 是 Python 中的 built-in 函数。
此外,由于您的输入文件似乎是有效的 XML,您可以通过直接使用 lxml 而不是通过 BeautifulSoup 来节省更多时间。或者更好的是,您可以使用 pandas.read_xml 将 XML 直接加载到数据帧中。不过需要注意的是;您应该分析生成的代码以检查它是否确实运行得更快,而不是相信我的话。关于性能的直觉是出了名的不可靠;你应该经常测量它。
我写了一段代码来解析大约一百个 XML 文件并创建一个数据帧。该代码工作正常,但可能需要相当长的时间,不到一个小时 运行。我确信有一种方法可以通过仅在循环末尾使用数据帧对象来改进此循环,或者您可能不需要三重嵌套循环将所有信息解析到数据帧中,但这是唯一的我作为新手能够做到的方式。
我的代码如下所示:
from bs4 import BeautifulSoup
import pandas as pd
import lxml
import json
import os
os.chdir(r"path_to_output_file/output_file")
f_list = os.listdir()
df_list = []
output_files = []
# checking we only itterate through XML files containing "calc_output"
for calc_output in f_list:
if "calc_output" in calc_output and calc_output.endswith(".xml"):
output_files.append(calc_output)
for calc_output in output_files:
with open(calc_output, "r") as datas:
print(f"reading file {calc_output} ...")
doc = BeautifulSoup(datas.read(), "lxml")
rows = []
timestamps = doc.time.find_all("timestamp")
for timestamp in timestamps: # parsing through every timestamp element
row= {}
time = timestamp.get("time") # reading timestamp attributes
temperature = timestamp.get("temperature")
zone_id = doc.zone.get("zone_id")
time_id = timestamp.get("time_id")
row.update({"time":time, "temperature":temperature, "time_id":time_id, "zone_id":zone_id})
row_copy = row.copy()
rows.append(row_copy)
# creating temporary dataframe to combine with other info
df1 = pd.DataFrame(rows)
rows= []
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
row= {}
#parsing through every surfacedata element
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
row={"time-begin":time_begin, "time-end":time_end}
things = surfacedata.find_all("thing", recursive=False)
#parsing through every thing in each surfacedata
for thing in things:
identity = id2name(thing.get("identity"))
row.update({"identity":identity})
locations = thing.find_all("loc ation", recursive=False)
for location in locations:
#parsing through every location for every thing for each surfacedata
l_identity = location.get("l_identity")
surface = location.getText()
row.update({"l_identity":l_identity, "surface":surface})
row_copy = row.copy()
rows.append(row_copy)
df2 = pd.DataFrame(rows) # second dataframe containing the information needed
#merging each dataframe on every loop
df =pd.merge(df1,df2, left_on="time_id", right_on="time-begin")
# then appending it to a list
df_list.append(df)
# final dataframe created by concatenating each dataframe from each output file
df = pd.concat(df_list)
df
XML 文件的示例为:
文件 1
<file filename="stack_example_1" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone zone_id="10">
<time>
<timestamp time_id="1" time="0" temperature="100"/>
<timestamp time_id="2" time="10.00" temperature="200"/>
</time>
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l_identity="2"> 1.256</location>
<location l_identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l_identity="2"> 1.6</location>
<location l_identity="5"> 2.5</location>
<location l_identity="78"> 3.2</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l_identity="17"> 2.4</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
文件 2
<file filename="stack_example_2" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone zone_id="11">
<time>
<timestamp time_id="1" time="0" temperature="100"/>
<timestamp time_id="2" time="10.00" temperature="200"/>
</time>
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.6</location>
<location l-identity="45"> 2.6</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.4</location>
<location l-identity="8"> 2.7</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="9"> 2.8</location>
<location l-identity="17"> 1.2</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
此代码使用 文件 1 和 文件 2 的输出为:
zone_id time time_id temperature tid-begin tid-end identity location surface
10 0 1 100 1 2 1 2 1,256
10 0 1 100 1 2 1 2 2,3
10 0 1 100 1 2 3 2 1,6
10 0 1 100 1 2 3 5 2,5
10 0 1 100 1 2 3 78 3,2
10 10 2 200 2 3 1 17 2,4
11 0 1 100 1 2 1 2 1,6
11 0 1 100 1 2 1 45 2,6
11 0 1 100 1 2 3 2 1,4
11 0 1 100 1 2 3 8 2,7
11 10 2 200 2 3 1 9 2,8
11 10 2 200 2 3 1 17 1,2
这是运行ning cProfile 后获得的输出:
Ordered by: internal time
List reduced from 6281 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
214204 95.337 0.000 95.340 0.000 C:\Users\anon\Anaconda3\lib\json\decoder.py:343(raw_decode)
214389 20.685 0.000 21.386 0.000 {built-in method io.open}
214288 17.945 0.000 17.945 0.000 {built-in method _codecs.charmap_decode}
1 16.745 16.745 336.360 336.360 .\anon_programm.py:7(<module>)
10 15.378 1.538 132.814 13.281 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:330(feed)
10277616 12.975 0.000 44.266 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:555(endData)
214228 12.504 0.000 30.575 0.000 {method 'read' of '_io.TextIOWrapper' objects}
3425862 11.257 0.000 75.608 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:223(start)
6851244 10.806 0.000 19.427 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:589(object_was_parsed)
17128360 8.580 0.000 8.580 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:158(setup)
3425862 8.389 0.000 8.694 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:527(popTag)
5961888 7.170 0.000 7.170 0.000 {method 'keys' of 'dict' objects}
3425872 7.072 0.000 23.054 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:1152(__init__)
214200 5.978 0.000 146.468 0.001 .\anon_programm.py:18(id2name)
3425862 5.913 0.000 61.118 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:691(handle_starttag)
3425002 4.482 0.000 12.571 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\__init__.py:285(_replace_cdata_list_attribute_values)
3425862 4.326 0.000 37.251 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:278(end)
3425862 4.244 0.000 13.552 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\__init__.py:657(_popToTag)
2751774 4.240 0.000 6.154 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:808(<genexpr>)
6851244 3.869 0.000 8.629 0.000 C:\Users\anon\Anaconda3\lib\site-packages\bs4\element.py:932(__new__)
这是在循环中被多次调用的函数:
import functools
@functools.lru_cache(maxsize=1000)
def id2name(id):
name_Dict = json.loads( open(r"path_to_JSON_file\file.json","r").read() )
name = ""
if id.isnumeric():
partial_id = id[:-1]
if partial_id not in name_Dict.keys():
return id
if id[-1] == "0":
return name_Dict[partial_id]
else:
return name_Dict[partial_id]+"x"+id[-1]
else:
return ""
正如对您的问题的评论中所指出的,大部分时间都花在了解码您的 id2name 函数中的 JSON 上。虽然函数的结果被缓存,但解析的 JSON 对象没有缓存,这意味着您正在从磁盘加载 JSON 文件并在每次查找新 ID 时解析它。
假设您每次加载相同的 JSON 文件,这意味着您应该通过缓存已解析的 JSON 对象立即获得 speed-up。您可以通过如下重组您的 id2name 函数来做到这一点。
import functools
@functools.lru_cache()
def load_name_dict():
with open(r"path_to_JSON_file\file.json", "r", encoding="utf-8") as f:
return json.load(f)
@functools.lru_cache(maxsize=1000)
def id2name(thing_id):
if not thing_id.isnumeric():
return ""
name_dict = load_name_dict()
name = name_dict.get(thing_id[:-1])
if name is None:
return thing_id
last_char = thing_id[-1]
if last_char == "0":
return name
else:
return name + "x" + last_char
请注意,我已经重构了 id2name 函数,以便在 ID 为 non-numeric 时不加载 JSON 对象。我还将其更改为使用 .get
方法而不是 in
以避免不必要的字典查找。此外,我将 id
更改为 thing_id
,因为 id 是 Python 中的 built-in 函数。
此外,由于您的输入文件似乎是有效的 XML,您可以通过直接使用 lxml 而不是通过 BeautifulSoup 来节省更多时间。或者更好的是,您可以使用 pandas.read_xml 将 XML 直接加载到数据帧中。不过需要注意的是;您应该分析生成的代码以检查它是否确实运行得更快,而不是相信我的话。关于性能的直觉是出了名的不可靠;你应该经常测量它。