XML 到 Dataframe,解析问题

XML to Dataframe, parsing issues

使用 XML2 包时,我在解析数据以将其组织到数据框中时遇到问题。开头的“header”数据令人困惑 xmlParse。我只想将所有 REPORT_DATA 元素中的信息放入数据框中。

我有代码可以将下面的文件很好地放入临时文件中,但是从那里操作它对我来说是个问题。我是 xml data-wrangling.

的新手

URL: http://oasis.caiso.com/oasisapi/SingleZip?queryname=SLD_REN_FCST&market_run_id=RTPD&startdatetime=20200711T00:00-0000&enddatetime=20200712T00:00-0000&version=1

这是上述 xml 文件的示例:

<?xml version="1.0" encoding="UTF-8"?><OASISReport xmlns="http://www.caiso.com/soa/OASISReport_v1.xsd">
<MessageHeader>
<TimeDate>2020-07-18T19:13:48-00:00</TimeDate>
<Source>OASIS</Source>
<Version>v20140401</Version>
</MessageHeader>
<MessagePayload>
<RTO>
<name>CAISO</name>
<REPORT_ITEM>
<REPORT_HEADER>
<SYSTEM>OASIS</SYSTEM>
<TZ>PPT</TZ>
<REPORT>SLD_REN_FCST</REPORT>
<MKT_TYPE>RTPD</MKT_TYPE>
<UOM>MW</UOM>
<INTERVAL>ENDING</INTERVAL>
<SEC_PER_INTERVAL>900</SEC_PER_INTERVAL>
</REPORT_HEADER>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>81</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T03:00:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:15:00-00:00</INTERVAL_END_GMT>
<VALUE>11.38</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>83</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T03:30:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:45:00-00:00</INTERVAL_END_GMT>
<VALUE>0</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
<REPORT_DATA>
<DATA_ITEM>RENEW_FCST_15MIN_MW</DATA_ITEM>
<OPR_DATE>2020-07-10</OPR_DATE>
<INTERVAL_NUM>80</INTERVAL_NUM>
<INTERVAL_START_GMT>2020-07-11T02:45:00-00:00</INTERVAL_START_GMT>
<INTERVAL_END_GMT>2020-07-11T03:00:00-00:00</INTERVAL_END_GMT>
<VALUE>56.89</VALUE>
<TRADING_HUB>NP15</TRADING_HUB>
<RENEWABLE_TYPE>Solar</RENEWABLE_TYPE>
</REPORT_DATA>
</REPORT_ITEM>
<DISCLAIMER_ITEM>
<DISCLAIMER>The contents of these pages are subject to change without notice.  Decisions based on information contained within the California ISO's web site are the visitor's sole responsibility.</DISCLAIMER>
</DISCLAIMER_ITEM>
</RTO>
</MessagePayload>
</OASISReport>

df 看起来像这样:

      data_item             opr_date      interval_num      value     trading_hub     renewable_type
   RENEW_FCST_15MIN_MW     2020-07-10         81             11.38      NP15             Solar
   RENEW_FCST_15MIN_MW     2020-07-10         83             0          NP15             Solar
          .                    .              .               .          .                 . 
          .                    .              .               .          .                 .

到目前为止,我已经这样做了:

test <- xmlParse(file = "/tmp/datafile.xml")
data <- xmlToDataFrame(test)

这并没有给我我想要的。它将所有实际数据塞入标记为 RTO 的单个单元格中。我还查看了 xml2 文档并修改了一些函数,但无法让它们仅提取 REPORT_DATA 属性和数据。

这对我有用:

test <- xmlParse(file = "data/20200710_20200711_SLD_REN_FCST_RTPD_20200718_14_49_12_v1.xml")
data <- xmlToDataFrame(xpathApply(test, '//*[local-name() = "REPORT_DATA"]'))

制作这个: