解析时在第一个实例之后应用停止 XML

Question

<- 为完整性而更新（感谢 hrbrmstr 指出）->

我正在尝试从 Pubmed 中提取一些数据，并且我一直在阅读 here (relevant diagram here 中的示例。我的数据的编辑版本如下所示：

<PubmedArticleSet>
   <PubmedArticle>
      <MedlineCitation Owner="NLM" Status="MEDLINE">
         <PMID Version="1">11841882</PMID>
         <Article PubModel="Print">
            <PublicationTypeList>
               <PublicationType UI="D002363">Case Reports</PublicationType>
               <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
         </Article>
         <MeshHeadingList>
            <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
            </MeshHeading>
            <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D006323">Heart Arrest</DescriptorName>
               <QualifierName MajorTopicYN="Y" UI="Q000188">drug therapy</QualifierName>
               <QualifierName MajorTopicYN="N" UI="Q000401">mortality</QualifierName>
               <QualifierName MajorTopicYN="N" UI="Q000628">therapy</QualifierName>
            </MeshHeading>
         </MeshHeadingList>
      </MedlineCitation>       
   </PubmedArticle>

   <PubmedArticle>
      <MedlineCitation Owner="NLM" Status="MEDLINE">
         <PMID Version="1">11841881</PMID>
         <Article PubModel="Print">
            <PublicationTypeList>
               <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
         </Article>
      <MeshHeadingList>
           <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D000368">Aged</DescriptorName>
           </MeshHeading>
           <MeshHeading>
              <DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
           </MeshHeading>
        </MeshHeadingList>
     </MedlineCitation>    
   </PubmedArticle>
</PubmedArticleSet>

到目前为止，我已经能够使用以下代码很好地提取 PublicationTypes（请先运行此 post 末尾顶部段中的代码）：

utilAtype <- function(x){
        PMID <- xmlValue(x[[1]][[1]])
        PublicationType <- sapply(xmlChildren(x[["Article"]][["PublicationTypeList"]], omitNodeTypes = "XMLInternalTextNode"), xmlValue)
        data.frame(PMID = PMID, PublicationType=PublicationType, stringsAsFactors = FALSE)
}

PMIDAType <- xpathApply(hdisease, '//MedlineCitation', utilAtype)
PMIDAType <-do.call(rbind, PMIDAType)

PMID 出版物类型

11841882 个案例报告

11841882 篇期刊文章

11841881 篇期刊文章

但是，在 MeshHeadings 上使用类似的方法会导致 sapply 跳过其余子节点，如下所示：

PMID LName

11841882心肺复苏

-11841182 的其他条目丢失-

11841881 岁

如果有人能启发我，我将不胜感激？它在样本中的完成方式表明这种方法应该没有问题。请参阅下面的代码以供参考。

require("XML")
xmlfile=xmlParse("file.xml", useInternalNodes = TRUE)
hdisease = xmlRoot(xmlfile)

utilMesh <- function(x){
        PMID <- xmlValue(x[[1]][[1]])
        MHead <- ifelse(is.null(x[["MeshHeadingList"]]), NA, 
                sapply(xmlChildren(x[["MeshHeadingList"]], omitNodeTypes = "XMLInternalTextNode"), function(z) xmlValue(z[["DescriptorName"]])))
        data.frame(PMID = PMID, MHead=MHead, stringsAsFactors = FALSE)
    }

PMIDMesh <- xpathApply(hdisease, '//MedlineCitation', utilMesh)
PMIDMesh<-do.call(rbind, PMIDMesh)

c<-nrow(PMIDMesh)
row.names(PMIDMesh) <- 1:c
nrow(table(PMIDMesh))

write.csv(PMIDMesh,"Mesh1.csv")

Answer 1

我会改用 xpath，也许...

library(rentrez)
x <- entrez_fetch("pubmed", "xml", id=c(11841882,11841881))
doc <- xmlParse(x)
pubs <- getNodeSet(doc, "//PubmedArticle")

y <- lapply(pubs, function(x) data.frame(
     pmid = xpathSApply(x, ".//MedlineCitation/PMID", xmlValue),
     mesh =  xpathSApply(x, ".//MeshHeading/DescriptorName", xmlValue)) )

do.call("rbind", y)

       pmid                          mesh
1  11841882 Cardiopulmonary Resuscitation
2  11841882              Child, Preschool
3  11841882                        Female
4  11841882                  Heart Arrest
5  11841882                        Humans
6  11841882                        Infant
7  11841882                          Male
8  11841882         Retrospective Studies
9  11841882                  Time Factors
10 11841882        Vasoconstrictor Agents
11 11841882                  Vasopressins
12 11841881                          Aged
13 11841881 Cardiopulmonary Resuscitation
14 11841881         Electric Countershock
15 11841881               Family Practice
...

解析时在第一个实例之后应用停止 XML

Sapply Stopping after the First Instance when Parsing XML

xml

r

sapply