xpathApply:如何传递多个路径或节点?
xpathApply: How to pass multiple paths or nodes?
# parse PubMed data
library(XML) # xpath
library(rentrez) # entrez_fetch
pmids <- c("25506969","25032371","24983039","24983034","24983032","24983031","26386083",
"26273372","26066373","25837167","25466451","25013473","23733758")
# Above IDs are mix of Books and journal articles
# ID# 23733758 is an journal article and has No abstract
data.pubmed <- entrez_fetch(db = "pubmed", id = pmids, rettype = "xml",
parsed = TRUE)
abstracts <- xpathApply(data.pubmed, "//Abstract", xmlValue)
names(abstracts) <- pmids
如果每条记录都有摘要,效果会很好。但是,当 PMID (#23733758) 没有 pubmed 摘要(或书籍文章或其他内容)时,它会跳过导致错误 'names' attribute [5] must be the same length as the vector [4]
问:如何传递多个 paths/nodes 以便我可以提取期刊文章、书籍或评论?
更新:hrbrmstr 解决方案有助于解决 NA。但是,xpathApply
可以像 c(//Abstract, //ReviewArticle , etc etc )
一样带多个节点吗?
你必须向上攻击一个标记元素:
abstracts <- xpathApply(data.pubmed, "//PubmedArticle//Article", function(x) {
val <- xpathSApply(x, "./Abstract", xmlValue)
if (length(val)==0) val <- NA_character_
val
})
names(abstracts) <- pmids
str(abstracts)
List of 5
## $ 24019382: chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ 23927882: chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ 23825589: chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ 23792568: chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ 23733758: chr NA
根据您的评论,另一种方法是:
str(xpathApply(data.pubmed, '//PubmedArticle//Article', function(x) {
xmlValue(xmlChildren(x)$Abstract)
}))
## List of 5
## $ : chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ : chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ : chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ : chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ : chr NA
# parse PubMed data
library(XML) # xpath
library(rentrez) # entrez_fetch
pmids <- c("25506969","25032371","24983039","24983034","24983032","24983031","26386083",
"26273372","26066373","25837167","25466451","25013473","23733758")
# Above IDs are mix of Books and journal articles
# ID# 23733758 is an journal article and has No abstract
data.pubmed <- entrez_fetch(db = "pubmed", id = pmids, rettype = "xml",
parsed = TRUE)
abstracts <- xpathApply(data.pubmed, "//Abstract", xmlValue)
names(abstracts) <- pmids
如果每条记录都有摘要,效果会很好。但是,当 PMID (#23733758) 没有 pubmed 摘要(或书籍文章或其他内容)时,它会跳过导致错误 'names' attribute [5] must be the same length as the vector [4]
问:如何传递多个 paths/nodes 以便我可以提取期刊文章、书籍或评论?
更新:hrbrmstr 解决方案有助于解决 NA。但是,xpathApply
可以像 c(//Abstract, //ReviewArticle , etc etc )
一样带多个节点吗?
你必须向上攻击一个标记元素:
abstracts <- xpathApply(data.pubmed, "//PubmedArticle//Article", function(x) {
val <- xpathSApply(x, "./Abstract", xmlValue)
if (length(val)==0) val <- NA_character_
val
})
names(abstracts) <- pmids
str(abstracts)
List of 5
## $ 24019382: chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ 23927882: chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ 23825589: chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ 23792568: chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ 23733758: chr NA
根据您的评论,另一种方法是:
str(xpathApply(data.pubmed, '//PubmedArticle//Article', function(x) {
xmlValue(xmlChildren(x)$Abstract)
}))
## List of 5
## $ : chr "Adenocarcinoma of the lung, a leading cause of cancer death, frequently displays mutational activation of the KRAS proto-oncoge"| __truncated__
## $ : chr "Mutations in components of the mitogen-activated protein kinase (MAPK) cascade may be a new candidate for target for lung cance"| __truncated__
## $ : chr "Aberrant activation of MAP kinase signaling pathway and loss of tumor suppressor LKB1 have been implicated in lung cancer devel"| __truncated__
## $ : chr "Sorafenib, the first agent developed to target BRAF mutant melanoma, is a multi-kinase inhibitor that was approved by the FDA f"| __truncated__
## $ : chr NA