如何访问 XML 文件中不同名称的子节点(子节点)的值?
How to access values of sub-nodes (child) with different names in XML file?
我正在尝试从 NCBI xml 文件中解析某些子节点的 xmlValue
。但是,对于某些 PM.IDs,Root node <PubmedArticleSet>
具有不同的信息 w.r.t 已发布的记录,PubmedBookArticle
和 PubmedArticle
。我想通过一个条件,if(xmlName(fetch.pubmed) == PubmedBookArticle
提取某些值elseif (xmlName(fetch.pubmed) == PubmedArticle
提取其他值。最后,用对应于它们的 PMID 的两个值创建一个 dataframe
。看起来很简单,但是 (xmlName(fetch.pubmed)
抛出错误 no applicable method for 'xmlName' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"
任何帮助表示感谢,谢谢
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2015//EN" "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_150101.dtd">
<PubmedArticleSet>
<PubmedBookArticle>
<BookDocument>
<PMID Version="1">25506969</PMID>
<ArticleIdList>
<ArticleId IdType="bookaccession">NBK259188</ArticleId>
</ArticleIdList> ....
...... </BookDocument>
</PubmedBookArticle>
<PubmedArticle>
<MedlineCitation Status="Publisher" Owner="NLM">
<PMID Version="1">25013473</PMID>
<DateCreated>
<Year>2014</Year>
<Month>7</Month>
<Day>11</Day>
</DateCreated>....
....</MedlineCitation>
</PubmedArticle>
</PubmedArticleSet>
我的代码在下面
library(XML)
library(rentrez)
PM.ID <- c("25506969"," 25032371"," 24983039","24983034","24983032","24983031",
"26386083","26273372","26066373","25837167",
"25466451","25013473")
# rentrez function to retrieve XMl file for above PIMD
fetch.pubmed <- entrez_fetch(db = "pubmed", id = PM.ID,
rettype = "xml", parsed = T)
# If empty records, return NA
FindNull <- function(x,x1child){
res <- xpathSApply(x,x1child,xmlValue)
if (length(res) == 0){
out <- NA
}else {
out <- res
}
out
}
# extract contents from xml file
xpathSApply(fetch.pubmed,"//PubmedArticle",FindNull,x1child = './/ArticleTitle')
xpathSApply(fetch.pubmed,"//PubmedBookArticle",FindNull,x1child = './/BookTitle')
如何在循环中获取上述代码,以便在每次搜索中满足条件时检索 PubmedArticle 和 PubmedBookArticle 中的值?
有几种方法可以做到这一点,但我可能会为书籍和文章获取单独的节点集。
table( xpathSApply(fetch.pubmed, "/PubmedArticleSet/*", xmlName) )
PubmedArticle PubmedBookArticle
6 6
books <- getNodeSet(fetch.pubmed, "/PubmedArticleSet/PubmedBookArticle")
data.frame( pmid = sapply(books, function(x) xpathSApply(x, ".//PMID", xmlValue)),
title = sapply(books, function(x) xpathSApply(x, ".//BookTitle", xmlValue))
)
pmid title
1 25506969 Probe Reports from the NIH Molecular Libraries Program
2 25032371 Understanding Climate’s Influence on Human Evolution
3 24983039 Assessing the Effects of the Gulf of Mexico Oil Spill on Human Health: A Summary of the June 2010 Workshop
4 24983034 In the Light of Evolution: Volume IV: The Human Condition
5 24983032 The Role of Human Factors in Home Health Care: Workshop Summary
下面的 NCBI XML 路径有助于从 PubmedArticle
、PubmedBookArticle
以及那些文章 without abstracts (NA)
中提取 abstracts
。
<!-- language: lang-r -->
abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article',
'//PubmedBookArticle//Abstract'), function(x) {
xmlValue(xmlChildren(x)$Abstract) })
abstracts <- data.frame(abstracts,stringsAsFactors = F)
dim(abstracts)
rownames(abstracts) <- PM.ID
我正在尝试从 NCBI xml 文件中解析某些子节点的 xmlValue
。但是,对于某些 PM.IDs,Root node <PubmedArticleSet>
具有不同的信息 w.r.t 已发布的记录,PubmedBookArticle
和 PubmedArticle
。我想通过一个条件,if(xmlName(fetch.pubmed) == PubmedBookArticle
提取某些值elseif (xmlName(fetch.pubmed) == PubmedArticle
提取其他值。最后,用对应于它们的 PMID 的两个值创建一个 dataframe
。看起来很简单,但是 (xmlName(fetch.pubmed)
抛出错误 no applicable method for 'xmlName' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"
任何帮助表示感谢,谢谢
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2015//EN" "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_150101.dtd">
<PubmedArticleSet>
<PubmedBookArticle>
<BookDocument>
<PMID Version="1">25506969</PMID>
<ArticleIdList>
<ArticleId IdType="bookaccession">NBK259188</ArticleId>
</ArticleIdList> ....
...... </BookDocument>
</PubmedBookArticle>
<PubmedArticle>
<MedlineCitation Status="Publisher" Owner="NLM">
<PMID Version="1">25013473</PMID>
<DateCreated>
<Year>2014</Year>
<Month>7</Month>
<Day>11</Day>
</DateCreated>....
....</MedlineCitation>
</PubmedArticle>
</PubmedArticleSet>
我的代码在下面
library(XML)
library(rentrez)
PM.ID <- c("25506969"," 25032371"," 24983039","24983034","24983032","24983031",
"26386083","26273372","26066373","25837167",
"25466451","25013473")
# rentrez function to retrieve XMl file for above PIMD
fetch.pubmed <- entrez_fetch(db = "pubmed", id = PM.ID,
rettype = "xml", parsed = T)
# If empty records, return NA
FindNull <- function(x,x1child){
res <- xpathSApply(x,x1child,xmlValue)
if (length(res) == 0){
out <- NA
}else {
out <- res
}
out
}
# extract contents from xml file
xpathSApply(fetch.pubmed,"//PubmedArticle",FindNull,x1child = './/ArticleTitle')
xpathSApply(fetch.pubmed,"//PubmedBookArticle",FindNull,x1child = './/BookTitle')
如何在循环中获取上述代码,以便在每次搜索中满足条件时检索 PubmedArticle 和 PubmedBookArticle 中的值?
有几种方法可以做到这一点,但我可能会为书籍和文章获取单独的节点集。
table( xpathSApply(fetch.pubmed, "/PubmedArticleSet/*", xmlName) )
PubmedArticle PubmedBookArticle
6 6
books <- getNodeSet(fetch.pubmed, "/PubmedArticleSet/PubmedBookArticle")
data.frame( pmid = sapply(books, function(x) xpathSApply(x, ".//PMID", xmlValue)),
title = sapply(books, function(x) xpathSApply(x, ".//BookTitle", xmlValue))
)
pmid title
1 25506969 Probe Reports from the NIH Molecular Libraries Program
2 25032371 Understanding Climate’s Influence on Human Evolution
3 24983039 Assessing the Effects of the Gulf of Mexico Oil Spill on Human Health: A Summary of the June 2010 Workshop
4 24983034 In the Light of Evolution: Volume IV: The Human Condition
5 24983032 The Role of Human Factors in Home Health Care: Workshop Summary
下面的 NCBI XML 路径有助于从
PubmedArticle
、PubmedBookArticle
以及那些文章without abstracts (NA)
中提取abstracts
。<!-- language: lang-r --> abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article', '//PubmedBookArticle//Abstract'), function(x) { xmlValue(xmlChildren(x)$Abstract) }) abstracts <- data.frame(abstracts,stringsAsFactors = F) dim(abstracts) rownames(abstracts) <- PM.ID