使用 R 解析 XML 文件进入数据框

Question

XML 数据

<HealthData locale="en_US">
 <ExportDate value="2016-06-02 14:05:23 -0400"/>
 <Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
 <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/>
 <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/>
 <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/>
</HealthData>

R代码

> library(XML)
> doc="\pathtoXMLfile"
> list <-xpathApply(doc, "//HealthData/Record", xmlAttrs)
> df <- do.call(rbind.data.frame, list)
> str(df)

我正在尝试获取上面显示的 XML 数据样本并将其加载到 R 中的数据框中，每个记录的名称即类型、源名称、单位、结束日期、值作为列 header 和每个记录值，即计数，2014-09-24 15:07:11 -0400，7 作为数据框中每一行的值。

当 df <- do.call(rbind.data.frame, list) 这个 get 关闭时，但看起来它也绑定了列 header 的所有值。如果你 View(df) 或 str(df) 你就会明白我的意思。如何使用记录变量名称作为列 header 名称？

谢谢，瑞安

Answer 1

考虑 xpathSApply() 检索属性，然后用 t() 将结果列表转置到数据帧中：

library(XML)

xmlstr <- '<?xml version="1.0" encoding="UTF-8"?>
            <HealthData locale="en_US">
              <ExportDate value="2016-06-02 14:05:23 -0400"/>
              <Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
              <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/>
              <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/>
              <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/>
            </HealthData>'

xml <- xmlParse(xmlstr)

recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record",  xmlAttrs)
df <- data.frame(t(recordAttribs))
df

#                                type              sourceName  unit
# 1 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# 2 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# 3 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
#                creationDate                 startDate                   endDate
# 1 2014-10-02 08:30:17 -0400 2014-09-24 15:07:06 -0400 2014-09-24 15:07:11 -0400
# 2 2014-10-02 08:30:17 -0400 2014-09-24 15:12:13 -0400 2014-09-24 15:12:18 -0400
# 3 2014-10-02 08:30:17 -0400 2014-09-24 15:17:16 -0400 2014-09-24 15:17:21 -0400
#   value
# 1     7
# 2    15
# 3    20

如果某些属性出现在某些属性中而不是其他属性中，请考虑匹配预先确定的名称列表并迭代填写 NAs。下面是使用 sapply() 和 for 循环和第二个列表参数的两个版本：

recordnames <- c("type", "unit", "sourceName", "device", "sourceVersion", 
                 "creationDate", "startDate", "endDate", "value")

# FOR LOOP VERSION
recordAttribs <- sapply(recordAttribs, function(i) {
  for (r in recordnames){
    i[r] <- ifelse(is.null(i[r]), NA, i[r])
  }
  i <- i[recordnames]  # REORDER INNER VECTORS
  return(i)
})

# TWO LIST ARGUMENT SAPPLY
recordAttribs <- sapply(recordAttribs, function(i,r) {  
    if (is.null(i[r])) i[r] <- NA
        else i[r] <- i[r]         
    i <- i[recordnames]  # REORDER INNER VECTORS
    return(i)
}, recordnames)


df <- data.frame(t(recordAttribs))

Answer 2

另一个选项是 xmlAttrsToDataFrame，它应该处理缺失的属性。您还可以获得具有特定属性的标签，例如 device

XML:::xmlAttrsToDataFrame(xml["//Record"])
XML:::xmlAttrsToDataFrame(xml["//Record[@device]"])

使用 R 解析 XML 文件进入数据框

Parse XML File with R Get into data frame

xml

r

xml-parsing

rbind