当变量具有相同名称时从多级 XML 中提取数据子集
Extracting subset of data from multilevel XML when variables have the same name
我有大量 xml 数据,看起来像这样(只显示了一小部分数据):
<weatherdata xmlns:xsi="http://www.website.com" xsi:noNamespaceSchemaLocation="www.website.com" created="2020-07-06T14:53:48Z">
<meta>
<model name="xxxxxx" termin="2020-07-06T06:00:00Z" runended="2020-07-06T09:48:31Z" nextrun="2020-07-06T16:00:00Z" from="2020-07-06T15:00:00Z" to="2020-07-08T12:00:00Z"/>
<model name="xxxxxx" termin="2020-07-06T00:00:00Z" runended="2020-07-06T09:48:31Z" nextrun="2020-07-06T18:00:00Z" from="2020-07-08T13:00:00Z" to="2020-07-09T18:00:00Z"/>
<model name="xxxxxx" termin="2020-07-06T00:00:00Z" runended="2020-07-06T09:48:31Z" nextrun="2020-07-06T18:00:00Z" from="2020-07-09T21:00:00Z" to="2020-07-12T00:00:00Z"/>
<model name="xxxxxx" termin="2020-07-06T00:00:00Z" runended="2020-07-06T09:48:31Z" nextrun="2020-07-06T18:00:00Z" from="2020-07-12T06:00:00Z" to="2020-07-16T00:00:00Z"/>
</meta>
<product class="pointData">
<time datatype="forecast" from="2020-07-06T15:00:00Z" to="2020-07-06T15:00:00Z">
<location altitude="10" latitude="123" longitude="123">
<temperature id="TTT" unit="celsius" value="18.8"/>
<windDirection id="dd" deg="296.5" name="NW"/>
<windSpeed id="ff" mps="5.8" beaufort="4" name="Laber bris"/>
<globalRadiation value="524.2" unit="W/m^2"/>
<humidity value="59.0" unit="percent"/>
<pressure id="pr" unit="hPa" value="1022.9"/>
<cloudiness id="NN" percent="22.7"/>
<lowClouds id="LOW" percent="22.7"/>
<mediumClouds id="MEDIUM" percent="0.0"/>
<highClouds id="HIGH" percent="0.0"/>
<dewpointTemperature id="TD" unit="celsius" value="10.6"/>
</location>
</time>
<time datatype="forecast" from="2020-07-06T14:00:00Z" to="2020-07-06T15:00:00Z">
<location altitude="10" latitude="123" longitude="123">
<precipitation unit="mm" value="0.0" minvalue="0.0" maxvalue="0.0" probability="2.0"/>
<symbol id="LightCloud" number="2"/>
</location>
</time>
<time datatype="forecast" from="2020-07-06T16:00:00Z" to="2020-07-06T16:00:00Z">
<location altitude="10" latitude="123" longitude="123">
<temperature id="TTT" unit="celsius" value="19.4"/>
<windDirection id="dd" deg="291.6" name="W"/>
<windSpeed id="ff" mps="6.3" beaufort="4" name="Laber bris"/>
<globalRadiation value="645.3" unit="W/m^2"/>
<humidity value="55.7" unit="percent"/>
<pressure id="pr" unit="hPa" value="1022.8"/>
<cloudiness id="NN" percent="47.5"/>
<lowClouds id="LOW" percent="47.5"/>
<mediumClouds id="MEDIUM" percent="0.0"/>
<highClouds id="HIGH" percent="0.1"/>
<dewpointTemperature id="TD" unit="celsius" value="10.3"/>
</location>
</time>
<time datatype="forecast" from="2020-07-06T15:00:00Z" to="2020-07-06T16:00:00Z">
<location altitude="10" latitude="123" longitude="123">
<precipitation unit="mm" value="0.0" minvalue="0.0" maxvalue="0.0" probability="2.2"/>
<symbol id="PartlyCloud" number="3"/>
</location>
</time>
我想提取环境数据并将其放入 pandas 数据框中。我可以使用以下方法执行此操作:
import xml.etree.ElementTree as et
import pandas as pd
tree = et.parse(data.xml) #load in the data
root = tree.getroot() # get the element tree root
celsius = []
for x in root.iter('temperature'):
value = x.attrib.get('value')
celsius.append(value)
tempdf = pd.DataFrame(celsius,columns=['Temperature (C)'])
tempdf
这为我提供了以下包含 114 列的数据框:
然后我可以对所有其他有趣的变量重复此操作,并使用 pd.concat
将它们连接在一起。问题在于 114 个数据块中的每个数据块都有两个 'time' 变量,因为 'precipitation' 具有单独的时间戳。当我尝试像这样解析时间数据时:
time = []
for x in root.iter('time'):
value = x.attrib.get('to')
time.append(value)
timedf = pd.DataFrame(time,columns=['Date & Time'])
timedf
这使数据行数增加一倍(228 而不是 114):
我无法将时间数据帧与其他数据帧连接在一起,因为时间行的数量是其他变量的两倍。我只想 select 来自 114 个实例中每个实例的第一个时间变量,即我想保留 time datatype="forecast" from="2020-07-06T15:00:00Z" to="2020-07-06T15:00:00Z"
并跳过第二个用于降水 time datatype="forecast" from="2020-07-06T14:00:00Z" to="2020-07-06T15:00:00Z"
的变量。我试过:
time = []
for x in root.iter('time')[0]:
value = x.attrib.get('to')
time.append(value)
但这行不通,而且我不确定当每个小时的数据中的变量名称都相同时我该如何做到这一点。如果能提供任何帮助,我将不胜感激。
考虑使用 concat
分别构建 temperature
数据框和 precipitation
数据框,然后 merge
通过 time
和location
个节点。并考虑使用 list/dict 理解将 所有 属性值绑定在一起。
import xml.etree.ElementTree as et
import pandas as pd
tree = et.parse('Input.xml') # load in the data
root = tree.getroot() # get the element tree root
temp_list = []; precip_list = []
for n, x in enumerate(root.iter('time')):
# GET LIST OF DICTIONARIES OF ALL ATTRIBUTES
x_list = [{i.tag+'_'+k:v for k,v in i.attrib.items()} for i in x.iter('*')]
# COMBINE INTO SINGLE DICTIONARY
x_dict = {k:v for d in x_list for k,v in d.items()}
# BUILD DATA FRAME
df = pd.DataFrame(x_dict, index=[0])
# SEPARATELY SAVE TO LIST OF DATA FRAMES
if 'temperature_unit' in df.columns: temp_list.append(df)
if 'precipitation_unit' in df.columns: precip_list.append(df)
# MERGE CONCATENATED SETS BY COMMON VARS
df = pd.merge(pd.concat(temp_list),
pd.concat(precip_list),
on=['time_to', 'time_datatype',
'location_altitude', 'location_latitude',
'location_longitude'],
suffixes=['_t','_p'])
我有大量 xml 数据,看起来像这样(只显示了一小部分数据):
<weatherdata xmlns:xsi="http://www.website.com" xsi:noNamespaceSchemaLocation="www.website.com" created="2020-07-06T14:53:48Z">
<meta>
<model name="xxxxxx" termin="2020-07-06T06:00:00Z" runended="2020-07-06T09:48:31Z" nextrun="2020-07-06T16:00:00Z" from="2020-07-06T15:00:00Z" to="2020-07-08T12:00:00Z"/>
<model name="xxxxxx" termin="2020-07-06T00:00:00Z" runended="2020-07-06T09:48:31Z" nextrun="2020-07-06T18:00:00Z" from="2020-07-08T13:00:00Z" to="2020-07-09T18:00:00Z"/>
<model name="xxxxxx" termin="2020-07-06T00:00:00Z" runended="2020-07-06T09:48:31Z" nextrun="2020-07-06T18:00:00Z" from="2020-07-09T21:00:00Z" to="2020-07-12T00:00:00Z"/>
<model name="xxxxxx" termin="2020-07-06T00:00:00Z" runended="2020-07-06T09:48:31Z" nextrun="2020-07-06T18:00:00Z" from="2020-07-12T06:00:00Z" to="2020-07-16T00:00:00Z"/>
</meta>
<product class="pointData">
<time datatype="forecast" from="2020-07-06T15:00:00Z" to="2020-07-06T15:00:00Z">
<location altitude="10" latitude="123" longitude="123">
<temperature id="TTT" unit="celsius" value="18.8"/>
<windDirection id="dd" deg="296.5" name="NW"/>
<windSpeed id="ff" mps="5.8" beaufort="4" name="Laber bris"/>
<globalRadiation value="524.2" unit="W/m^2"/>
<humidity value="59.0" unit="percent"/>
<pressure id="pr" unit="hPa" value="1022.9"/>
<cloudiness id="NN" percent="22.7"/>
<lowClouds id="LOW" percent="22.7"/>
<mediumClouds id="MEDIUM" percent="0.0"/>
<highClouds id="HIGH" percent="0.0"/>
<dewpointTemperature id="TD" unit="celsius" value="10.6"/>
</location>
</time>
<time datatype="forecast" from="2020-07-06T14:00:00Z" to="2020-07-06T15:00:00Z">
<location altitude="10" latitude="123" longitude="123">
<precipitation unit="mm" value="0.0" minvalue="0.0" maxvalue="0.0" probability="2.0"/>
<symbol id="LightCloud" number="2"/>
</location>
</time>
<time datatype="forecast" from="2020-07-06T16:00:00Z" to="2020-07-06T16:00:00Z">
<location altitude="10" latitude="123" longitude="123">
<temperature id="TTT" unit="celsius" value="19.4"/>
<windDirection id="dd" deg="291.6" name="W"/>
<windSpeed id="ff" mps="6.3" beaufort="4" name="Laber bris"/>
<globalRadiation value="645.3" unit="W/m^2"/>
<humidity value="55.7" unit="percent"/>
<pressure id="pr" unit="hPa" value="1022.8"/>
<cloudiness id="NN" percent="47.5"/>
<lowClouds id="LOW" percent="47.5"/>
<mediumClouds id="MEDIUM" percent="0.0"/>
<highClouds id="HIGH" percent="0.1"/>
<dewpointTemperature id="TD" unit="celsius" value="10.3"/>
</location>
</time>
<time datatype="forecast" from="2020-07-06T15:00:00Z" to="2020-07-06T16:00:00Z">
<location altitude="10" latitude="123" longitude="123">
<precipitation unit="mm" value="0.0" minvalue="0.0" maxvalue="0.0" probability="2.2"/>
<symbol id="PartlyCloud" number="3"/>
</location>
</time>
我想提取环境数据并将其放入 pandas 数据框中。我可以使用以下方法执行此操作:
import xml.etree.ElementTree as et
import pandas as pd
tree = et.parse(data.xml) #load in the data
root = tree.getroot() # get the element tree root
celsius = []
for x in root.iter('temperature'):
value = x.attrib.get('value')
celsius.append(value)
tempdf = pd.DataFrame(celsius,columns=['Temperature (C)'])
tempdf
这为我提供了以下包含 114 列的数据框:
然后我可以对所有其他有趣的变量重复此操作,并使用 pd.concat
将它们连接在一起。问题在于 114 个数据块中的每个数据块都有两个 'time' 变量,因为 'precipitation' 具有单独的时间戳。当我尝试像这样解析时间数据时:
time = []
for x in root.iter('time'):
value = x.attrib.get('to')
time.append(value)
timedf = pd.DataFrame(time,columns=['Date & Time'])
timedf
这使数据行数增加一倍(228 而不是 114):
我无法将时间数据帧与其他数据帧连接在一起,因为时间行的数量是其他变量的两倍。我只想 select 来自 114 个实例中每个实例的第一个时间变量,即我想保留 time datatype="forecast" from="2020-07-06T15:00:00Z" to="2020-07-06T15:00:00Z"
并跳过第二个用于降水 time datatype="forecast" from="2020-07-06T14:00:00Z" to="2020-07-06T15:00:00Z"
的变量。我试过:
time = []
for x in root.iter('time')[0]:
value = x.attrib.get('to')
time.append(value)
但这行不通,而且我不确定当每个小时的数据中的变量名称都相同时我该如何做到这一点。如果能提供任何帮助,我将不胜感激。
考虑使用 concat
分别构建 temperature
数据框和 precipitation
数据框,然后 merge
通过 time
和location
个节点。并考虑使用 list/dict 理解将 所有 属性值绑定在一起。
import xml.etree.ElementTree as et
import pandas as pd
tree = et.parse('Input.xml') # load in the data
root = tree.getroot() # get the element tree root
temp_list = []; precip_list = []
for n, x in enumerate(root.iter('time')):
# GET LIST OF DICTIONARIES OF ALL ATTRIBUTES
x_list = [{i.tag+'_'+k:v for k,v in i.attrib.items()} for i in x.iter('*')]
# COMBINE INTO SINGLE DICTIONARY
x_dict = {k:v for d in x_list for k,v in d.items()}
# BUILD DATA FRAME
df = pd.DataFrame(x_dict, index=[0])
# SEPARATELY SAVE TO LIST OF DATA FRAMES
if 'temperature_unit' in df.columns: temp_list.append(df)
if 'precipitation_unit' in df.columns: precip_list.append(df)
# MERGE CONCATENATED SETS BY COMMON VARS
df = pd.merge(pd.concat(temp_list),
pd.concat(precip_list),
on=['time_to', 'time_datatype',
'location_altitude', 'location_latitude',
'location_longitude'],
suffixes=['_t','_p'])