如何测试 xml 文件中是否存在节点(字符串)以在循环中使用以从多个文件中提取数据?
How to test the presence of a node (string) in an xml file to use in a loop to extract data from multiple files?
作为新的 R 用户,我已经为这个问题苦苦挣扎了一段时间,但我自己无法解决。也许答案很简单,有人可以帮助我。我的挑战是我在一个文件夹中有数千个 xm 文件,我想从每个文件中提取特定节点的内容并保存在数据框中。但是,这些文件重复了我感兴趣的节点的名称。所以我用数字而不是名字来提取我想要的数据。
for (i in (1 : length(file_list))) {
test.file<- file_list[i]
datax<-xmlParse(test.file) #enter the xml file name you want to analyze
data<-xmlToList(datax) #convert xml as a list
serial<-as.vector(unlist(data$.attrs[2]))
print(serial)
# Check if the xml file contains the node AuditRules
n <- ifelse(xml_find_all(test.file, "//AuditRules") == TRUE, 6, 5)
#Extract waveform values for the Current ECG srip
waveform <- as.vector(data[[n]][[3]][[1]][[1]][[2]])
waveform <- as.character(waveform)
waveform<-strsplit(waveform, split = " ")
waveform<-as.numeric(unlist(waveform))
waveform<-as.data.frame(waveform)
#Extract the serial number to be used as ID for the animal and create a column on the dataframe
serial<-as.vector(unlist(data$.attrs[2]))
serial<-as.factor(serial)
waveform$serial<-serial
#Extract date and time of Current ECG and save it as a column date
date<-as.vector(unlist(data$.attrs[n]))
date <- gsub("T", " ", date)
waveform$date <- as.POSIXct(date, format = "%Y-%m-%d %H:%M:%S", tz = 'Etc/GMT+5')
#Extract time offset [the first R-R interval from the Current ECG ]
offset <-as.vector(unlist(data[[n]][[3]][[1]][[1]][[1]][[1]]))
offset <- gsub("[a-zA-Z]+", "", offset)
waveform$offset <- offset
#Crate a column for voltage in mv using the amplitudeScaleFactor="0.000815"
#waveform$mv <- waveform$waveform*0.000815
#Create a column for time (sec) using the sampleInterval="PT0.0078125S"
#waveform$time <- as.numeric(waveform$offset)
# add a new column to old data.frame. Set value "offset" as the starting value for row 1.
# populate newcol with values starting from row 2.
#for (i in 4:nrow(waveform)){
# waveform[i,6] <- waveform[i-1,6] +0.0078125
# Write data to CSV
write.csv(waveform, paste0(data_export_dir,"/Savannah_", file_names[i],"_ECG.csv"))
}
我的问题:有些文件在感兴趣的节点之前有一个额外的节点 [5]。对于那些我需要将感兴趣的节点更改为 [6]。
我的问题:如何更改上面的代码以包含条件(存在或不存在额外节点)并相应地交替使用 [5] 或 [6]。
我试图将类似这样的东西添加到我的循环中,但它不起作用:
for (i in (1 : length(file_list))) {
test.file<- file_list[i]
datax<-xmlParse(test.file)
data<-xmlToList(datax) #convert xml as a list
serial<-as.vector(unlist(data$.attrs[2]))
print(serial)
# Check if the xml file contains the node AuditRules
n <- ifelse(xml_find_all(test.file, "//AuditRules") == TRUE, 6, 5)
#Extract waveform values for the Current ECG srip
waveform <- as.vector(data[[n]][[3]][[1]][[1]][[2]])
waveform <- as.character(waveform)
waveform<-strsplit(waveform, split = " ")
waveform<-as.numeric(unlist(waveform))
waveform<-as.data.frame(waveform)
如有任何帮助,我将不胜感激!提前致谢。
我要从中提取数据的 xml 文件的结构示例:
xml.file.a <- c(<SessionInfo><BatteryData><AuditRules><Counters><Trend><CardiacOccurrenceRecord><OccurrenceDateTime><DateTime>2019-07-02T06:05:00</DateTime></OccurrenceDateTime><OccurrenceType><Discrete>CurrentECG</Discrete></OccurrenceType><EpisodeRecord episodeRecordLength="PT10.842S"><Strip><WaveformChannel amplitudeResolution="0.000815" amplitudeScaleFactor="0.000815" amplitudeUnit="mV" sampleInterval="PT0.0078125S"><WaveformSegment length="PT0.607S" offset="PT0S" state="EgmNotStored"/><WaveformSegment length="PT10.235S" offset="PT0.607S" samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463 " state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S" state="EndRecording"/></CardiacOccurrenceRecord><CardiacOccurrenceRecord><CardiacOccurrenceRecord>
xml.file.b <- c(<SessionInfo><BatteryData><Counters><Trend><CardiacOccurrenceRecord><OccurrenceDateTime><DateTime>2019-07-02T06:05:00</DateTime></OccurrenceDateTime><OccurrenceType><Discrete>CurrentECG</Discrete></OccurrenceType><EpisodeRecord episodeRecordLength="PT10.842S"><Strip><WaveformChannel amplitudeResolution="0.000815" amplitudeScaleFactor="0.000815" amplitudeUnit="mV" sampleInterval="PT0.0078125S"><WaveformSegment length="PT0.607S" offset="PT0S" state="EgmNotStored"/><WaveformSegment length="PT10.235S" offset="PT0.607S" samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463 " state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S" state="EndRecording"/></CardiacOccurrenceRecord><CardiacOccurrenceRecord><CardiacOccurrenceRecord>
我需要从第一个节点中提取以下数据段:
<WaveformSegment length="PT10.235S" offset="PT0.607S"samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463 "state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S"
当我尝试使用以下代码时,我提取了名为 CardiacOcurrenceRecord 的所有节点的所有段,但无法弄清楚如何只获取第一个。
xml_1 <- xmlParse("20200116_020831_RLA496828S.xml")
xmltop <- xmlRoot(xml_1)
xpathSApply(xmltop, '//WaveformSegment[2]')
xpathSApply()
函数正在返回所有 WaveformSegment 的列表。获得该列表后,只需使用 [[ ]]
访问各个列表元素即可。
另请注意,在下面的示例中,我使用了“.”。在 XPath 中,这表示仅搜索此节点下方而不是整个文档。在这种情况下,这无关紧要,因为变量“xmltop”是整个文档,但如果我们从“EpisodeRecord”开始,那么我们可能只需要那个节点的“Waveform Segment”。
library(XML)
#find top root
xmltop <- xmlRoot(xml_1)
#Find the Second WaveformSegment in all nodes - returns a list
CardiacOccurrence<-xpathSApply(xmltop, './/WaveformSegment[2]')
#access the first element in the list
CardiacOccurrence[[1]]
#obtain vector of attributes
attrs<-xmlAttrs(CardiacOccurrence[[1]])
attrs[["samples"]]
#[1] "-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463 "
希望这个答案或问题。
作为新的 R 用户,我已经为这个问题苦苦挣扎了一段时间,但我自己无法解决。也许答案很简单,有人可以帮助我。我的挑战是我在一个文件夹中有数千个 xm 文件,我想从每个文件中提取特定节点的内容并保存在数据框中。但是,这些文件重复了我感兴趣的节点的名称。所以我用数字而不是名字来提取我想要的数据。
for (i in (1 : length(file_list))) {
test.file<- file_list[i]
datax<-xmlParse(test.file) #enter the xml file name you want to analyze
data<-xmlToList(datax) #convert xml as a list
serial<-as.vector(unlist(data$.attrs[2]))
print(serial)
# Check if the xml file contains the node AuditRules
n <- ifelse(xml_find_all(test.file, "//AuditRules") == TRUE, 6, 5)
#Extract waveform values for the Current ECG srip
waveform <- as.vector(data[[n]][[3]][[1]][[1]][[2]])
waveform <- as.character(waveform)
waveform<-strsplit(waveform, split = " ")
waveform<-as.numeric(unlist(waveform))
waveform<-as.data.frame(waveform)
#Extract the serial number to be used as ID for the animal and create a column on the dataframe
serial<-as.vector(unlist(data$.attrs[2]))
serial<-as.factor(serial)
waveform$serial<-serial
#Extract date and time of Current ECG and save it as a column date
date<-as.vector(unlist(data$.attrs[n]))
date <- gsub("T", " ", date)
waveform$date <- as.POSIXct(date, format = "%Y-%m-%d %H:%M:%S", tz = 'Etc/GMT+5')
#Extract time offset [the first R-R interval from the Current ECG ]
offset <-as.vector(unlist(data[[n]][[3]][[1]][[1]][[1]][[1]]))
offset <- gsub("[a-zA-Z]+", "", offset)
waveform$offset <- offset
#Crate a column for voltage in mv using the amplitudeScaleFactor="0.000815"
#waveform$mv <- waveform$waveform*0.000815
#Create a column for time (sec) using the sampleInterval="PT0.0078125S"
#waveform$time <- as.numeric(waveform$offset)
# add a new column to old data.frame. Set value "offset" as the starting value for row 1.
# populate newcol with values starting from row 2.
#for (i in 4:nrow(waveform)){
# waveform[i,6] <- waveform[i-1,6] +0.0078125
# Write data to CSV
write.csv(waveform, paste0(data_export_dir,"/Savannah_", file_names[i],"_ECG.csv"))
}
我的问题:有些文件在感兴趣的节点之前有一个额外的节点 [5]。对于那些我需要将感兴趣的节点更改为 [6]。 我的问题:如何更改上面的代码以包含条件(存在或不存在额外节点)并相应地交替使用 [5] 或 [6]。 我试图将类似这样的东西添加到我的循环中,但它不起作用:
for (i in (1 : length(file_list))) {
test.file<- file_list[i]
datax<-xmlParse(test.file)
data<-xmlToList(datax) #convert xml as a list
serial<-as.vector(unlist(data$.attrs[2]))
print(serial)
# Check if the xml file contains the node AuditRules
n <- ifelse(xml_find_all(test.file, "//AuditRules") == TRUE, 6, 5)
#Extract waveform values for the Current ECG srip
waveform <- as.vector(data[[n]][[3]][[1]][[1]][[2]])
waveform <- as.character(waveform)
waveform<-strsplit(waveform, split = " ")
waveform<-as.numeric(unlist(waveform))
waveform<-as.data.frame(waveform)
如有任何帮助,我将不胜感激!提前致谢。
我要从中提取数据的 xml 文件的结构示例:
xml.file.a <- c(<SessionInfo><BatteryData><AuditRules><Counters><Trend><CardiacOccurrenceRecord><OccurrenceDateTime><DateTime>2019-07-02T06:05:00</DateTime></OccurrenceDateTime><OccurrenceType><Discrete>CurrentECG</Discrete></OccurrenceType><EpisodeRecord episodeRecordLength="PT10.842S"><Strip><WaveformChannel amplitudeResolution="0.000815" amplitudeScaleFactor="0.000815" amplitudeUnit="mV" sampleInterval="PT0.0078125S"><WaveformSegment length="PT0.607S" offset="PT0S" state="EgmNotStored"/><WaveformSegment length="PT10.235S" offset="PT0.607S" samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463 " state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S" state="EndRecording"/></CardiacOccurrenceRecord><CardiacOccurrenceRecord><CardiacOccurrenceRecord>
xml.file.b <- c(<SessionInfo><BatteryData><Counters><Trend><CardiacOccurrenceRecord><OccurrenceDateTime><DateTime>2019-07-02T06:05:00</DateTime></OccurrenceDateTime><OccurrenceType><Discrete>CurrentECG</Discrete></OccurrenceType><EpisodeRecord episodeRecordLength="PT10.842S"><Strip><WaveformChannel amplitudeResolution="0.000815" amplitudeScaleFactor="0.000815" amplitudeUnit="mV" sampleInterval="PT0.0078125S"><WaveformSegment length="PT0.607S" offset="PT0S" state="EgmNotStored"/><WaveformSegment length="PT10.235S" offset="PT0.607S" samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463 " state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S" state="EndRecording"/></CardiacOccurrenceRecord><CardiacOccurrenceRecord><CardiacOccurrenceRecord>
我需要从第一个节点中提取以下数据段:
<WaveformSegment length="PT10.235S" offset="PT0.607S"samples="-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463 "state="Stored"/><WaveformSegment length="PT0S" offset="PT10.842S"
当我尝试使用以下代码时,我提取了名为 CardiacOcurrenceRecord 的所有节点的所有段,但无法弄清楚如何只获取第一个。
xml_1 <- xmlParse("20200116_020831_RLA496828S.xml")
xmltop <- xmlRoot(xml_1)
xpathSApply(xmltop, '//WaveformSegment[2]')
xpathSApply()
函数正在返回所有 WaveformSegment 的列表。获得该列表后,只需使用 [[ ]]
访问各个列表元素即可。
另请注意,在下面的示例中,我使用了“.”。在 XPath 中,这表示仅搜索此节点下方而不是整个文档。在这种情况下,这无关紧要,因为变量“xmltop”是整个文档,但如果我们从“EpisodeRecord”开始,那么我们可能只需要那个节点的“Waveform Segment”。
library(XML)
#find top root
xmltop <- xmlRoot(xml_1)
#Find the Second WaveformSegment in all nodes - returns a list
CardiacOccurrence<-xpathSApply(xmltop, './/WaveformSegment[2]')
#access the first element in the list
CardiacOccurrence[[1]]
#obtain vector of attributes
attrs<-xmlAttrs(CardiacOccurrence[[1]])
attrs[["samples"]]
#[1] "-456 -454 -454 -458 -457 -457 -459 -457 -459 -460 -459 -458 -460 -463 "
希望这个答案或问题。