解析时在第一个实例之后应用停止 XML
Sapply Stopping after the First Instance when Parsing XML
<- 为完整性而更新(感谢 hrbrmstr 指出)->
我正在尝试从 Pubmed 中提取一些数据,并且我一直在阅读 here (relevant diagram here 中的示例。
我的数据的编辑版本如下所示:
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">11841882</PMID>
<Article PubModel="Print">
<PublicationTypeList>
<PublicationType UI="D002363">Case Reports</PublicationType>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D006323">Heart Arrest</DescriptorName>
<QualifierName MajorTopicYN="Y" UI="Q000188">drug therapy</QualifierName>
<QualifierName MajorTopicYN="N" UI="Q000401">mortality</QualifierName>
<QualifierName MajorTopicYN="N" UI="Q000628">therapy</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">11841881</PMID>
<Article PubModel="Print">
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D000368">Aged</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
</PubmedArticleSet>
到目前为止,我已经能够使用以下代码很好地提取 PublicationTypes(请先 运行 此 post 末尾顶部段中的代码):
utilAtype <- function(x){
PMID <- xmlValue(x[[1]][[1]])
PublicationType <- sapply(xmlChildren(x[["Article"]][["PublicationTypeList"]], omitNodeTypes = "XMLInternalTextNode"), xmlValue)
data.frame(PMID = PMID, PublicationType=PublicationType, stringsAsFactors = FALSE)
}
PMIDAType <- xpathApply(hdisease, '//MedlineCitation', utilAtype)
PMIDAType <-do.call(rbind, PMIDAType)
PMID 出版物类型
11841882 个案例报告
11841882 篇期刊文章
11841881 篇期刊文章
但是,在 MeshHeadings 上使用类似的方法会导致 sapply 跳过其余子节点,如下所示:
PMID LName
11841882心肺复苏
-11841182 的其他条目丢失-
11841881 岁
如果有人能启发我,我将不胜感激?它在样本中的完成方式表明这种方法应该没有问题。
请参阅下面的代码以供参考。
require("XML")
xmlfile=xmlParse("file.xml", useInternalNodes = TRUE)
hdisease = xmlRoot(xmlfile)
utilMesh <- function(x){
PMID <- xmlValue(x[[1]][[1]])
MHead <- ifelse(is.null(x[["MeshHeadingList"]]), NA,
sapply(xmlChildren(x[["MeshHeadingList"]], omitNodeTypes = "XMLInternalTextNode"), function(z) xmlValue(z[["DescriptorName"]])))
data.frame(PMID = PMID, MHead=MHead, stringsAsFactors = FALSE)
}
PMIDMesh <- xpathApply(hdisease, '//MedlineCitation', utilMesh)
PMIDMesh<-do.call(rbind, PMIDMesh)
c<-nrow(PMIDMesh)
row.names(PMIDMesh) <- 1:c
nrow(table(PMIDMesh))
write.csv(PMIDMesh,"Mesh1.csv")
我会改用 xpath,也许...
library(rentrez)
x <- entrez_fetch("pubmed", "xml", id=c(11841882,11841881))
doc <- xmlParse(x)
pubs <- getNodeSet(doc, "//PubmedArticle")
y <- lapply(pubs, function(x) data.frame(
pmid = xpathSApply(x, ".//MedlineCitation/PMID", xmlValue),
mesh = xpathSApply(x, ".//MeshHeading/DescriptorName", xmlValue)) )
do.call("rbind", y)
pmid mesh
1 11841882 Cardiopulmonary Resuscitation
2 11841882 Child, Preschool
3 11841882 Female
4 11841882 Heart Arrest
5 11841882 Humans
6 11841882 Infant
7 11841882 Male
8 11841882 Retrospective Studies
9 11841882 Time Factors
10 11841882 Vasoconstrictor Agents
11 11841882 Vasopressins
12 11841881 Aged
13 11841881 Cardiopulmonary Resuscitation
14 11841881 Electric Countershock
15 11841881 Family Practice
...
<- 为完整性而更新(感谢 hrbrmstr 指出)->
我正在尝试从 Pubmed 中提取一些数据,并且我一直在阅读 here (relevant diagram here 中的示例。 我的数据的编辑版本如下所示:
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">11841882</PMID>
<Article PubModel="Print">
<PublicationTypeList>
<PublicationType UI="D002363">Case Reports</PublicationType>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D006323">Heart Arrest</DescriptorName>
<QualifierName MajorTopicYN="Y" UI="Q000188">drug therapy</QualifierName>
<QualifierName MajorTopicYN="N" UI="Q000401">mortality</QualifierName>
<QualifierName MajorTopicYN="N" UI="Q000628">therapy</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">11841881</PMID>
<Article PubModel="Print">
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D000368">Aged</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
</PubmedArticleSet>
到目前为止,我已经能够使用以下代码很好地提取 PublicationTypes(请先 运行 此 post 末尾顶部段中的代码):
utilAtype <- function(x){
PMID <- xmlValue(x[[1]][[1]])
PublicationType <- sapply(xmlChildren(x[["Article"]][["PublicationTypeList"]], omitNodeTypes = "XMLInternalTextNode"), xmlValue)
data.frame(PMID = PMID, PublicationType=PublicationType, stringsAsFactors = FALSE)
}
PMIDAType <- xpathApply(hdisease, '//MedlineCitation', utilAtype)
PMIDAType <-do.call(rbind, PMIDAType)
PMID 出版物类型
11841882 个案例报告
11841882 篇期刊文章
11841881 篇期刊文章
但是,在 MeshHeadings 上使用类似的方法会导致 sapply 跳过其余子节点,如下所示:
PMID LName
11841882心肺复苏
-11841182 的其他条目丢失-
11841881 岁
如果有人能启发我,我将不胜感激?它在样本中的完成方式表明这种方法应该没有问题。 请参阅下面的代码以供参考。
require("XML")
xmlfile=xmlParse("file.xml", useInternalNodes = TRUE)
hdisease = xmlRoot(xmlfile)
utilMesh <- function(x){
PMID <- xmlValue(x[[1]][[1]])
MHead <- ifelse(is.null(x[["MeshHeadingList"]]), NA,
sapply(xmlChildren(x[["MeshHeadingList"]], omitNodeTypes = "XMLInternalTextNode"), function(z) xmlValue(z[["DescriptorName"]])))
data.frame(PMID = PMID, MHead=MHead, stringsAsFactors = FALSE)
}
PMIDMesh <- xpathApply(hdisease, '//MedlineCitation', utilMesh)
PMIDMesh<-do.call(rbind, PMIDMesh)
c<-nrow(PMIDMesh)
row.names(PMIDMesh) <- 1:c
nrow(table(PMIDMesh))
write.csv(PMIDMesh,"Mesh1.csv")
我会改用 xpath,也许...
library(rentrez)
x <- entrez_fetch("pubmed", "xml", id=c(11841882,11841881))
doc <- xmlParse(x)
pubs <- getNodeSet(doc, "//PubmedArticle")
y <- lapply(pubs, function(x) data.frame(
pmid = xpathSApply(x, ".//MedlineCitation/PMID", xmlValue),
mesh = xpathSApply(x, ".//MeshHeading/DescriptorName", xmlValue)) )
do.call("rbind", y)
pmid mesh
1 11841882 Cardiopulmonary Resuscitation
2 11841882 Child, Preschool
3 11841882 Female
4 11841882 Heart Arrest
5 11841882 Humans
6 11841882 Infant
7 11841882 Male
8 11841882 Retrospective Studies
9 11841882 Time Factors
10 11841882 Vasoconstrictor Agents
11 11841882 Vasopressins
12 11841881 Aged
13 11841881 Cardiopulmonary Resuscitation
14 11841881 Electric Countershock
15 11841881 Family Practice
...