在 Python 3.x 中解析 xml

Parse xml in Python 3.x

我有一些 xml 代码要解析。我希望使用 ElementTree 而不是 BeautifulSoup 因为我对后者处理 xml.

的方式有一些问题

我想从以下内容中提取文本:

我使用 ElementTree 的哪些功能来完成工作?

我一直在尝试使用 .attribattrib.get().iter.attrib[key] 来获取文本,但未能成功访问实际文本。

<PubmedArticleSet>
   <PubmedArticle>
       <PMID Version="1">10890875</PMID>
       <Journal>
           <ISSN IssnType="Print">0143-005X</ISSN>
            <Title>Journal of epidemiology and community health</Title>
       </Journal>
       <ArticleTitle>Sources of influence on medical practice. 
       </ArticleTitle>
       <Abstract>
          <AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">
             To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
          </AbstractText>
          <AbstractText Label="METHODS" NlmCategory="METHODS">
             General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study. 
          </AbstractText>
          <AbstractText Label="RESULTS" NlmCategory="RESULTS">
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues. 
          </AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting. </AbstractText>
        </Abstract>
        <Language>eng</Language>
        <PublicationTypeList>
           <PublicationType UI="D016428">Journal Article 
           </PublicationType>
           <PublicationType UI="D013485">Research Support, Non-U.S.Gov't </PublicationType>
        </PublicationTypeList>
    <PubmedData>
         <PublicationStatus>ppublish</PublicationStatus>
         <ArticleIdList>
            <ArticleId IdType="pubmed">10890875</ArticleId>
            <ArticleId IdType="pmc">PMC1731730</ArticleId>
         </ArticleIdList>
     </PubmedData>
   </PubmedArticle>
</PubmedArticleSet>

我希望得到的结果是: 生成每个 "label" 的 AbstractText 获取 "label"

的文本

使用 Css 选择器尝试以下代码。

from bs4 import BeautifulSoup

html='''<PubmedArticleSet>
   <PubmedArticle>
       <PMID Version="1">10890875</PMID>
       <Journal>
           <ISSN IssnType="Print">0143-005X</ISSN>
            <Title>Journal of epidemiology and community health</Title>
       </Journal>
       <ArticleTitle>Sources of influence on medical practice. 
       </ArticleTitle>
       <Abstract>
          <AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">
             To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
          </AbstractText>
          <AbstractText Label="METHODS" NlmCategory="METHODS">
             General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study. 
          </AbstractText>
          <AbstractText Label="RESULTS" NlmCategory="RESULTS">
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues. 
          </AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting. </AbstractText>
        </Abstract>
        <Language>eng</Language>
        <PublicationTypeList>
           <PublicationType UI="D016428">Journal Article 
           </PublicationType>
           <PublicationType UI="D013485">Research Support, Non-U.S.Gov't </PublicationType>
        </PublicationTypeList>
    <PubmedData>
         <PublicationStatus>ppublish</PublicationStatus>
         <ArticleIdList>
            <ArticleId IdType="pubmed">10890875</ArticleId>
            <ArticleId IdType="pmc">PMC1731730</ArticleId>
         </ArticleIdList>
     </PubmedData>
   </PubmedArticle>
</PubmedArticleSet>'''

soup = BeautifulSoup(html, 'lxml')

maintag=soup.select_one('Abstract')
for childtag in maintag.select('AbstractText'):
    print(childtag.text.strip())

print(soup.select_one('ArticleId[IdType="pmc"]').text)

输出:

To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study.
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues.
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting.
PMC1731730

总的来说,我经常使用 .find() 方法来查看 XML 已经用 ElementTree 解析过的文件。然后对于您找到的任何内容,您可以使用 element.text、element.attrib 和 element.tag 分别获取文本、属性字典和元素名称。

将其与列表理解相结合,听起来这就是您要找的东西。

例如,假设您将 xml 文件保存为 'publications.xml':

import xml.etree.ElementTree as ET

filename = 'publications.xml'
content = ET.parse(filename)
root = content.getroot()

abstracts = [a.text for a in root.find('PubmedArticle/Abstract')]

将为您提供 4 个摘要中的文本列表。

可以用类似的方式访问所有 ID,添加对正确 IdType 的检查。通过上面提到的方法,您可以类似地获取名称为 'ArticleId' 的所有元素的列表,然后使用

访问 IdType
element.attrib['IdType']

对于给定列表中的每个元素。

对于最后一个请求,我不完全确定您所说的先检索 UI-value 是什么意思。如果您只想确保检索到这两个值,您可以遍历

中的所有元素
root.find('PubmedArticle/PublicationTypeList')

并保存 element.attrib['UI'] 和 element.text