解析时在第一个实例之后应用停止 XML

Sapply Stopping after the First Instance when Parsing XML

<- 为完整性而更新(感谢 hrbrmstr 指出)->

我正在尝试从 Pubmed 中提取一些数据,并且我一直在阅读 here (relevant diagram here 中的示例。 我的数据的编辑版本如下所示:

<PubmedArticleSet>
   <PubmedArticle>
      <MedlineCitation Owner="NLM" Status="MEDLINE">
         <PMID Version="1">11841882</PMID>
         <Article PubModel="Print">
            <PublicationTypeList>
               <PublicationType UI="D002363">Case Reports</PublicationType>
               <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
         </Article>
         <MeshHeadingList>
            <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
            </MeshHeading>
            <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D006323">Heart Arrest</DescriptorName>
               <QualifierName MajorTopicYN="Y" UI="Q000188">drug therapy</QualifierName>
               <QualifierName MajorTopicYN="N" UI="Q000401">mortality</QualifierName>
               <QualifierName MajorTopicYN="N" UI="Q000628">therapy</QualifierName>
            </MeshHeading>
         </MeshHeadingList>
      </MedlineCitation>       
   </PubmedArticle>

   <PubmedArticle>
      <MedlineCitation Owner="NLM" Status="MEDLINE">
         <PMID Version="1">11841881</PMID>
         <Article PubModel="Print">
            <PublicationTypeList>
               <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
         </Article>
      <MeshHeadingList>
           <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D000368">Aged</DescriptorName>
           </MeshHeading>
           <MeshHeading>
              <DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
           </MeshHeading>
        </MeshHeadingList>
     </MedlineCitation>    
   </PubmedArticle>
</PubmedArticleSet>

到目前为止,我已经能够使用以下代码很好地提取 PublicationTypes(请先 运行 此 post 末尾顶部段中的代码):

utilAtype <- function(x){
        PMID <- xmlValue(x[[1]][[1]])
        PublicationType <- sapply(xmlChildren(x[["Article"]][["PublicationTypeList"]], omitNodeTypes = "XMLInternalTextNode"), xmlValue)
        data.frame(PMID = PMID, PublicationType=PublicationType, stringsAsFactors = FALSE)
}

PMIDAType <- xpathApply(hdisease, '//MedlineCitation', utilAtype)
PMIDAType <-do.call(rbind, PMIDAType)

PMID 出版物类型

11841882 个案例报告

11841882 篇期刊文章

11841881 篇期刊文章

但是,在 MeshHeadings 上使用类似的方法会导致 sapply 跳过其余子节点,如下所示:

PMID LName

11841882心肺复苏

-11841182 的其他条目丢失-

11841881 岁

如果有人能启发我,我将不胜感激?它在样本中的完成方式表明这种方法应该没有问题。 请参阅下面的代码以供参考。

require("XML")
xmlfile=xmlParse("file.xml", useInternalNodes = TRUE)
hdisease = xmlRoot(xmlfile)

utilMesh <- function(x){
        PMID <- xmlValue(x[[1]][[1]])
        MHead <- ifelse(is.null(x[["MeshHeadingList"]]), NA, 
                sapply(xmlChildren(x[["MeshHeadingList"]], omitNodeTypes = "XMLInternalTextNode"), function(z) xmlValue(z[["DescriptorName"]])))
        data.frame(PMID = PMID, MHead=MHead, stringsAsFactors = FALSE)
    }

PMIDMesh <- xpathApply(hdisease, '//MedlineCitation', utilMesh)
PMIDMesh<-do.call(rbind, PMIDMesh)

c<-nrow(PMIDMesh)
row.names(PMIDMesh) <- 1:c
nrow(table(PMIDMesh))

write.csv(PMIDMesh,"Mesh1.csv")

我会改用 xpath,也许...

library(rentrez)
x <- entrez_fetch("pubmed", "xml", id=c(11841882,11841881))
doc <- xmlParse(x)
pubs <- getNodeSet(doc, "//PubmedArticle")

y <- lapply(pubs, function(x) data.frame(
     pmid = xpathSApply(x, ".//MedlineCitation/PMID", xmlValue),
     mesh =  xpathSApply(x, ".//MeshHeading/DescriptorName", xmlValue)) )

do.call("rbind", y)

       pmid                          mesh
1  11841882 Cardiopulmonary Resuscitation
2  11841882              Child, Preschool
3  11841882                        Female
4  11841882                  Heart Arrest
5  11841882                        Humans
6  11841882                        Infant
7  11841882                          Male
8  11841882         Retrospective Studies
9  11841882                  Time Factors
10 11841882        Vasoconstrictor Agents
11 11841882                  Vasopressins
12 11841881                          Aged
13 11841881 Cardiopulmonary Resuscitation
14 11841881         Electric Countershock
15 11841881               Family Practice
...