R将具有分层数据的XML树解析为数据帧
R parsing XML tree with hierarchical data to dataframe
我正在尝试在 R XML-- 中解析一些 xml 文档。数据框。我想要做的是展平 XML 树,以便我在每个数据框中得到一行,child。我还希望每一行都包含来自 parent
的数据
示例:
<xml>
<eventlist>
<event>
<ProcessIndex>1063</ProcessIndex>
<Time_of_Day>2:54:20.2959537 PM</Time_of_Day>
<Process_Name>chrome.exe</Process_Name>
<PID>12164</PID>
<Operation>ReadFile</Operation>
<Result>SUCCESS</Result>
<Detail>Offset: 1,684,224, Length: 256</Detail>
<stack>
<frame>
<depth>0</depth>
<address>0xfffff8038683667c</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x1a6c</location>
</frame>
<frame>
<depth>1</depth>
<address>0xfffff80386834e13</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x203</location>
</frame>
<frame>
<depth>3</depth>
<address>0x7ffea54ffac1</address>
<path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
<location>RtlUserThreadStart + 0x21</location>
</frame>
</stack>
</event>
<event>
<ProcessIndex>1063</ProcessIndex>
<Time_of_Day>2:54:20.2960270 PM</Time_of_Day>
<Process_Name>chrome.exe</Process_Name>
<PID>12164</PID>
<Operation>WriteFile</Operation>
<Result>SUCCESS</Result>
<Detail>Offset: 103,016, Length: 36</Detail>
<stack>
<frame>
<depth>0</depth>
<address>0xfffff8038683667c</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x1a6c</location>
</frame>
<frame>
<depth>1</depth>
<address>0xfffff80386834e13</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x203</location>
</frame>
<frame>
<depth>26</depth>
<address>0x7ffea54ffac1</address>
<path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
<location>RtlUserThreadStart + 0x21</location>
</frame>
</stack>
</event>
</eventlist>
</xml>
我想要得到的结果是
ProcesnIndex Time_of_day Proces_Name PID Operation Result depth address path location
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 0 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 1 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 2 0xfffff.. C:\WINDOWS\System32\driv... tlUserThreadStart + 0x21
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 0 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 1 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 2 0xfffff.. C:\WINDOWS\System32\driv... RtlUserThreadStart + 0x21
我尝试使用 XML 包和 xmlToDataFrame
xmldf_events_stack <- xmlToDataFrame(nodes=getNodeSet(data_xml_2,"//eventlist/event/stack/frame"))
但这只会给我没有 parent 数据的扁平帧。此外,如果我尝试将事件数据解析为数据帧,所有 XML 标签都会从帧字段中删除,因此我以后无法解析它。
任何正确方向的帮助或指导将不胜感激
我解决了问题,我确信有更优雅的方法可以做到这一点,但这就是我所做的。希望对以后的人有所帮助
df <- do.call(rbind.fill, lapply(data_xml_2['//eventlist/event'], function(x) {
names <- xpathSApply(x, './/.', xmlName)
names <- names[which(names == "text") - 1]
values <- xpathSApply(x, ".//text()", xmlValue)
framevalues <- values[8:length(values)]
framevalues <- matrix(framevalues, ncol = 4, byrow = TRUE)
retvalues <- framevalues
for(i in 7:1){
retvalues <- cbind(values[i],retvalues)
}
colnames(retvalues) <- names[1:12]
return(as.data.frame(retvalues))
}))
考虑按节点索引 [##]
进行解析,然后将父项与子项合并到 lapply
中,以便将数据帧列表完全行绑定:
doc <- xmlParse("/path/to/XML/file.xml")
xml_len <- length(getNodeSet(doc,"//eventlist/event"))
dflist <- lapply(seq(xml_len), function(i){
# PARENT NODES
d1 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]"))), key=1)
# CHILD NODES
d2 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]/stack/frame"))), key=1)
# MERGE ON KEY, THEN DROP KEY
merge(d1, d2, by="key")[-1]
})
xmldf_events_stack <- do.call(rbind, dflist)
我正在尝试在 R XML-- 中解析一些 xml 文档。数据框。我想要做的是展平 XML 树,以便我在每个数据框中得到一行,child。我还希望每一行都包含来自 parent
的数据示例:
<xml>
<eventlist>
<event>
<ProcessIndex>1063</ProcessIndex>
<Time_of_Day>2:54:20.2959537 PM</Time_of_Day>
<Process_Name>chrome.exe</Process_Name>
<PID>12164</PID>
<Operation>ReadFile</Operation>
<Result>SUCCESS</Result>
<Detail>Offset: 1,684,224, Length: 256</Detail>
<stack>
<frame>
<depth>0</depth>
<address>0xfffff8038683667c</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x1a6c</location>
</frame>
<frame>
<depth>1</depth>
<address>0xfffff80386834e13</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x203</location>
</frame>
<frame>
<depth>3</depth>
<address>0x7ffea54ffac1</address>
<path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
<location>RtlUserThreadStart + 0x21</location>
</frame>
</stack>
</event>
<event>
<ProcessIndex>1063</ProcessIndex>
<Time_of_Day>2:54:20.2960270 PM</Time_of_Day>
<Process_Name>chrome.exe</Process_Name>
<PID>12164</PID>
<Operation>WriteFile</Operation>
<Result>SUCCESS</Result>
<Detail>Offset: 103,016, Length: 36</Detail>
<stack>
<frame>
<depth>0</depth>
<address>0xfffff8038683667c</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x1a6c</location>
</frame>
<frame>
<depth>1</depth>
<address>0xfffff80386834e13</address>
<path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
<location>FltDecodeParameters + 0x203</location>
</frame>
<frame>
<depth>26</depth>
<address>0x7ffea54ffac1</address>
<path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
<location>RtlUserThreadStart + 0x21</location>
</frame>
</stack>
</event>
</eventlist>
</xml>
我想要得到的结果是
ProcesnIndex Time_of_day Proces_Name PID Operation Result depth address path location
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 0 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 1 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063 2:54:20 chrome.exe 12164 ReadFile SUCCESS 2 0xfffff.. C:\WINDOWS\System32\driv... tlUserThreadStart + 0x21
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 0 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 1 0xfffff.. C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063 2:54:20 chrome.exe 12164 WriteFile SUCCESS 2 0xfffff.. C:\WINDOWS\System32\driv... RtlUserThreadStart + 0x21
我尝试使用 XML 包和 xmlToDataFrame
xmldf_events_stack <- xmlToDataFrame(nodes=getNodeSet(data_xml_2,"//eventlist/event/stack/frame"))
但这只会给我没有 parent 数据的扁平帧。此外,如果我尝试将事件数据解析为数据帧,所有 XML 标签都会从帧字段中删除,因此我以后无法解析它。
任何正确方向的帮助或指导将不胜感激
我解决了问题,我确信有更优雅的方法可以做到这一点,但这就是我所做的。希望对以后的人有所帮助
df <- do.call(rbind.fill, lapply(data_xml_2['//eventlist/event'], function(x) {
names <- xpathSApply(x, './/.', xmlName)
names <- names[which(names == "text") - 1]
values <- xpathSApply(x, ".//text()", xmlValue)
framevalues <- values[8:length(values)]
framevalues <- matrix(framevalues, ncol = 4, byrow = TRUE)
retvalues <- framevalues
for(i in 7:1){
retvalues <- cbind(values[i],retvalues)
}
colnames(retvalues) <- names[1:12]
return(as.data.frame(retvalues))
}))
考虑按节点索引 [##]
进行解析,然后将父项与子项合并到 lapply
中,以便将数据帧列表完全行绑定:
doc <- xmlParse("/path/to/XML/file.xml")
xml_len <- length(getNodeSet(doc,"//eventlist/event"))
dflist <- lapply(seq(xml_len), function(i){
# PARENT NODES
d1 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]"))), key=1)
# CHILD NODES
d2 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]/stack/frame"))), key=1)
# MERGE ON KEY, THEN DROP KEY
merge(d1, d2, by="key")[-1]
})
xmldf_events_stack <- do.call(rbind, dflist)