如何测试 xml 文件中是否存在节点（字符串）以在循环中使用以从多个文件中提取数据？

Question

作为新的 R 用户，我已经为这个问题苦苦挣扎了一段时间，但我自己无法解决。也许答案很简单，有人可以帮助我。我的挑战是我在一个文件夹中有数千个 xm 文件，我想从每个文件中提取特定节点的内容并保存在数据框中。但是，这些文件重复了我感兴趣的节点的名称。所以我用数字而不是名字来提取我想要的数据。

for (i in (1 : length(file_list))) {
  
test.file<- file_list[i]
datax<-xmlParse(test.file)  #enter the xml file name you want to analyze
data<-xmlToList(datax) #convert xml as a list
serial<-as.vector(unlist(data$.attrs[2]))
print(serial)

# Check if the xml file contains the node AuditRules

n <- ifelse(xml_find_all(test.file, "//AuditRules") == TRUE, 6, 5)

#Extract waveform values for the Current ECG srip

waveform <- as.vector(data[[n]][[3]][[1]][[1]][[2]])
waveform <- as.character(waveform)
waveform<-strsplit(waveform, split = " ")
waveform<-as.numeric(unlist(waveform))
waveform<-as.data.frame(waveform)

#Extract the serial number to be used as ID for the animal and create a column on the dataframe

serial<-as.vector(unlist(data$.attrs[2]))
serial<-as.factor(serial)
waveform$serial<-serial

#Extract date and time of Current ECG and save it as a column date

date<-as.vector(unlist(data$.attrs[n]))
date <- gsub("T", " ", date)
waveform$date <- as.POSIXct(date, format = "%Y-%m-%d %H:%M:%S", tz = 'Etc/GMT+5')

#Extract time offset [the first R-R interval from the Current ECG ]

offset <-as.vector(unlist(data[[n]][[3]][[1]][[1]][[1]][[1]]))
offset <- gsub("[a-zA-Z]+", "", offset)
waveform$offset <- offset

#Crate a column for voltage in mv using the  amplitudeScaleFactor="0.000815" 
#waveform$mv <- waveform$waveform*0.000815

#Create a column for time (sec) using the sampleInterval="PT0.0078125S"

#waveform$time <- as.numeric(waveform$offset)
# add a new column to old data.frame. Set value "offset"  as the starting value for row 1.

# populate newcol with values starting from row 2.

#for (i in 4:nrow(waveform)){
#  waveform[i,6] <- waveform[i-1,6] +0.0078125

# Write data to CSV  
 
 write.csv(waveform, paste0(data_export_dir,"/Savannah_", file_names[i],"_ECG.csv"))
}

我的问题：有些文件在感兴趣的节点之前有一个额外的节点 [5]。对于那些我需要将感兴趣的节点更改为 [6]。我的问题：如何更改上面的代码以包含条件（存在或不存在额外节点）并相应地交替使用 [5] 或 [6]。我试图将类似这样的东西添加到我的循环中，但它不起作用：

for (i in (1 : length(file_list))) {
  
test.file<- file_list[i]
datax<-xmlParse(test.file)  
data<-xmlToList(datax) #convert xml as a list
serial<-as.vector(unlist(data$.attrs[2]))
print(serial)

# Check if the xml file contains the node AuditRules

n <- ifelse(xml_find_all(test.file, "//AuditRules") == TRUE, 6, 5)

#Extract waveform values for the Current ECG srip

waveform <- as.vector(data[[n]][[3]][[1]][[1]][[2]])
waveform <- as.character(waveform)
waveform<-strsplit(waveform, split = " ")
waveform<-as.numeric(unlist(waveform))
waveform<-as.data.frame(waveform)

如有任何帮助，我将不胜感激！提前致谢。

我要从中提取数据的 xml 文件的结构示例：

xml.file.a <- c(<SessionInfo><BatteryData><AuditRules><Counters><Trend><CardiacOccurrenceRecord><OccurrenceDateTime><DateTime>2019-07-02T06:05:00</DateTime></OccurrenceDateTime><OccurrenceType><Discrete>CurrentECG</Discrete></OccurrenceType><EpisodeRecord episodeRecordLength="PT10.842S"><Strip><WaveformChannel amplitudeResolution="0.000815" amplitudeScaleFactor="0.000815" amplitudeUnit="mV" sampleInterval="PT0.0078125S"><WaveformSegment length="PT0.607S" offset="PT0S" state="EgmNotStored"/><WaveformSegment length="PT10.235S" offset="PT0.607S" samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463  " state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S" state="EndRecording"/></CardiacOccurrenceRecord><CardiacOccurrenceRecord><CardiacOccurrenceRecord>

xml.file.b <- c(<SessionInfo><BatteryData><Counters><Trend><CardiacOccurrenceRecord><OccurrenceDateTime><DateTime>2019-07-02T06:05:00</DateTime></OccurrenceDateTime><OccurrenceType><Discrete>CurrentECG</Discrete></OccurrenceType><EpisodeRecord episodeRecordLength="PT10.842S"><Strip><WaveformChannel amplitudeResolution="0.000815" amplitudeScaleFactor="0.000815" amplitudeUnit="mV" sampleInterval="PT0.0078125S"><WaveformSegment length="PT0.607S" offset="PT0S" state="EgmNotStored"/><WaveformSegment length="PT10.235S" offset="PT0.607S" samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463  " state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S" state="EndRecording"/></CardiacOccurrenceRecord><CardiacOccurrenceRecord><CardiacOccurrenceRecord>

我需要从第一个节点中提取以下数据段：

<WaveformSegment length="PT10.235S" offset="PT0.607S"samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463  "state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S"

当我尝试使用以下代码时，我提取了名为 CardiacOcurrenceRecord 的所有节点的所有段，但无法弄清楚如何只获取第一个。

xml_1 <- xmlParse("20200116_020831_RLA496828S.xml")
xmltop <- xmlRoot(xml_1) 
xpathSApply(xmltop, '//WaveformSegment[2]')

Answer 1

xpathSApply() 函数正在返回所有 WaveformSegment 的列表。获得该列表后，只需使用 [[ ]] 访问各个列表元素即可。

另请注意，在下面的示例中，我使用了“.”。在 XPath 中，这表示仅搜索此节点下方而不是整个文档。在这种情况下，这无关紧要，因为变量“xmltop”是整个文档，但如果我们从“EpisodeRecord”开始，那么我们可能只需要那个节点的“Waveform Segment”。

library(XML)
#find top root
xmltop <- xmlRoot(xml_1) 
#Find the Second WaveformSegment in all nodes - returns a list
CardiacOccurrence<-xpathSApply(xmltop, './/WaveformSegment[2]')
#access the first element in the list
CardiacOccurrence[[1]]

#obtain vector of attributes
attrs<-xmlAttrs(CardiacOccurrence[[1]])

attrs[["samples"]]
#[1] "-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463  "

希望这个答案或问题。

如何测试 xml 文件中是否存在节点（字符串）以在循环中使用以从多个文件中提取数据？

How to test the presence of a node (string) in an xml file to use in a loop to extract data from multiple files?

xml

if-statement

r

xml-parsing