XML 到 Dataframe,解析问题
XML to Dataframe, parsing issues
使用 XML2 包时,我在解析数据以将其组织到数据框中时遇到问题。开头的“header”数据令人困惑 xmlParse。我只想将所有 REPORT_DATA 元素中的信息放入数据框中。
我有代码可以将下面的文件很好地放入临时文件中,但是从那里操作它对我来说是个问题。我是 xml data-wrangling.
的新手
这是上述 xml 文件的示例:
<?xml version="1.0" encoding="UTF-8"?><OASISReport xmlns="http://www.caiso.com/soa/OASISReport_v1.xsd">
<MessageHeader>
<TimeDate>2020-07-18T19:13:48-00:00</TimeDate>
<Source>OASIS</Source>
<Version>v20140401</Version>
</MessageHeader>
<MessagePayload>
<RTO>
<name>CAISO</name>
<REPORT_ITEM>
<REPORT_HEADER>
<SYSTEM>OASIS</SYSTEM>
<TZ>PPT</TZ>
<REPORT>SLD_REN_FCST</REPORT>
<MKT_TYPE>RTPD</MKT_TYPE>
<UOM>MW</UOM>
<INTERVAL>ENDING</INTERVAL>
<SEC_PER_INTERVAL>900</SEC_PER_INTERVAL>
</REPORT_HEADER>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>81</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T03:00:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:15:00-00:00</INTERVAL_END_GMT>
<VALUE>11.38</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>83</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T03:30:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:45:00-00:00</INTERVAL_END_GMT>
<VALUE>0</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>80</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T02:45:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:00:00-00:00</INTERVAL_END_GMT>
<VALUE>56.89</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
</REPORT_ITEM>
<DISCLAIMER_ITEM>
<DISCLAIMER>The contents of these pages are subject to change without notice. Decisions based on information contained within the California ISO's web site are the visitor's sole responsibility.</DISCLAIMER>
</DISCLAIMER_ITEM>
</RTO>
</MessagePayload>
</OASISReport>
df 看起来像这样:
data_item opr_date interval_num value trading_hub renewable_type
RENEW_FCST_15MIN_MW 2020-07-10 81 11.38 NP15 Solar
RENEW_FCST_15MIN_MW 2020-07-10 83 0 NP15 Solar
. . . . . .
. . . . . .
到目前为止,我已经这样做了:
test <- xmlParse(file = "/tmp/datafile.xml")
data <- xmlToDataFrame(test)
这并没有给我我想要的。它将所有实际数据塞入标记为 RTO 的单个单元格中。我还查看了 xml2 文档并修改了一些函数,但无法让它们仅提取 REPORT_DATA 属性和数据。
这对我有用:
test <- xmlParse(file = "data/20200710_20200711_SLD_REN_FCST_RTPD_20200718_14_49_12_v1.xml")
data <- xmlToDataFrame(xpathApply(test, '//*[local-name() = "REPORT_DATA"]'))
制作这个:
使用 XML2 包时,我在解析数据以将其组织到数据框中时遇到问题。开头的“header”数据令人困惑 xmlParse。我只想将所有 REPORT_DATA 元素中的信息放入数据框中。
我有代码可以将下面的文件很好地放入临时文件中,但是从那里操作它对我来说是个问题。我是 xml data-wrangling.
的新手这是上述 xml 文件的示例:
<?xml version="1.0" encoding="UTF-8"?><OASISReport xmlns="http://www.caiso.com/soa/OASISReport_v1.xsd">
<MessageHeader>
<TimeDate>2020-07-18T19:13:48-00:00</TimeDate>
<Source>OASIS</Source>
<Version>v20140401</Version>
</MessageHeader>
<MessagePayload>
<RTO>
<name>CAISO</name>
<REPORT_ITEM>
<REPORT_HEADER>
<SYSTEM>OASIS</SYSTEM>
<TZ>PPT</TZ>
<REPORT>SLD_REN_FCST</REPORT>
<MKT_TYPE>RTPD</MKT_TYPE>
<UOM>MW</UOM>
<INTERVAL>ENDING</INTERVAL>
<SEC_PER_INTERVAL>900</SEC_PER_INTERVAL>
</REPORT_HEADER>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>81</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T03:00:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:15:00-00:00</INTERVAL_END_GMT>
<VALUE>11.38</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>83</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T03:30:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:45:00-00:00</INTERVAL_END_GMT>
<VALUE>0</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>80</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T02:45:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:00:00-00:00</INTERVAL_END_GMT>
<VALUE>56.89</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
</REPORT_ITEM>
<DISCLAIMER_ITEM>
<DISCLAIMER>The contents of these pages are subject to change without notice. Decisions based on information contained within the California ISO's web site are the visitor's sole responsibility.</DISCLAIMER>
</DISCLAIMER_ITEM>
</RTO>
</MessagePayload>
</OASISReport>
df 看起来像这样:
data_item opr_date interval_num value trading_hub renewable_type
RENEW_FCST_15MIN_MW 2020-07-10 81 11.38 NP15 Solar
RENEW_FCST_15MIN_MW 2020-07-10 83 0 NP15 Solar
. . . . . .
. . . . . .
到目前为止,我已经这样做了:
test <- xmlParse(file = "/tmp/datafile.xml")
data <- xmlToDataFrame(test)
这并没有给我我想要的。它将所有实际数据塞入标记为 RTO 的单个单元格中。我还查看了 xml2 文档并修改了一些函数,但无法让它们仅提取 REPORT_DATA 属性和数据。
这对我有用:
test <- xmlParse(file = "data/20200710_20200711_SLD_REN_FCST_RTPD_20200718_14_49_12_v1.xml")
data <- xmlToDataFrame(xpathApply(test, '//*[local-name() = "REPORT_DATA"]'))
制作这个: