lxml：获取所有叶节点？

Question

给一个XML文件，有没有办法用lxml得到所有叶子节点的名字和属性？

这是感兴趣的 XML 文件：

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study>
  <!-- This xml conforms to an XML Schema at:
    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
 and an XML DTD at:
    http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
  <id_info>
    <org_study_id>3370-2(-4)</org_study_id>
    <nct_id>NCT00753818</nct_id>
    <nct_alias>NCT00222157</nct_alias>
  </id_info>
  <brief_title>Developmental Effects of Infant Formula Supplemented With LCPUFA</brief_title>
  <sponsors>
    <lead_sponsor>
      <agency>Mead Johnson Nutrition</agency>
      <agency_class>Industry</agency_class>
    </lead_sponsor>
  </sponsors>
  <source>Mead Johnson Nutrition</source>
  <oversight_info>
    <authority>United States: Institutional Review Board</authority>
  </oversight_info>
  <brief_summary>
    <textblock>
      The purpose of this study is to compare the effects on visual development, growth, cognitive
      development, tolerance, and blood chemistry parameters in term infants fed one of four study
      formulas containing various levels of DHA and ARA.
    </textblock>
  </brief_summary>
  <overall_status>Completed</overall_status>
  <phase>N/A</phase>
  <study_type>Interventional</study_type>
  <study_design>N/A</study_design>
  <primary_outcome>
    <measure>visual development</measure>
  </primary_outcome>
  <secondary_outcome>
    <measure>Cognitive development</measure>
  </secondary_outcome>
  <number_of_arms>4</number_of_arms>
  <condition>Cognitive Development</condition>
  <condition>Growth</condition>
  <arm_group>
    <arm_group_label>1</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>2</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>3</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>4</arm_group_label>
    <arm_group_type>Other</arm_group_type>
    <description>Control</description>
  </arm_group>
  <intervention>
    <intervention_type>Other</intervention_type>
    <intervention_name>DHA and ARA</intervention_name>
    <description>various levels of DHA and ARA</description>
    <arm_group_label>1</arm_group_label>
    <arm_group_label>2</arm_group_label>
    <arm_group_label>3</arm_group_label>
  </intervention>
  <intervention>
    <intervention_type>Other</intervention_type>
    <intervention_name>Control</intervention_name>
    <arm_group_label>4</arm_group_label>
  </intervention>
</clinical_study>

我想要的是这样的字典：

{
   'id_info_org_study_id': '3370-2(-4)', 
   'id_info_nct_id': 'NCT00753818', 
   'id_info_nct_alias': 'NCT00222157', 
   'brief_title': 'Developmental Effects...'
}

这对 lxml 或任何其他 Python 库是否可行？

更新：

我最后是这样做的：

response = requests.get(url)
tree = lxml.etree.fromstring(response.content)
mydict = self._recurse_over_nodes(tree, None, {})

def _recurse_over_nodes(self, tree, parent_key, data):
    for branch in tree:
        key = branch.tag
        if branch.getchildren():
            if parent_key:
                key = '%s_%s' % (parent_key, key)
            data = self._recurse_over_nodes(branch, key, data)
        else:
            if parent_key:
                key = '%s_%s' % (parent_key, key)
            if key in data:
                data[key] = data[key] + ', %s' % branch.text
            else:
                data[key] = branch.text
    return data

Answer 1

假设你已经完成了getroot()，像下面这样简单的事情就可以构建一个你期望的字典：

import lxml.etree

tree = lxml.etree.parse('sample_ctgov.xml')
root = tree.getroot()

d = {}
for node in root:
    key = node.tag
    if node.getchildren():
        for child in node:
            key += '_' + child.tag
            d.update({key: child.text})
    else:
        d.update({key: node.text})

应该可以解决问题，既不优化也不递归搜索所有子节点，但您知道从哪里开始。

Answer 2

试试这个：

from xml.etree import ElementTree

def crawl(root, prefix='', memo={}):
    new_prefix = root.tag
    if len(prefix) > 0:
        new_prefix = prefix + "_" + new_prefix
    for child in root.getchildren():
        crawl(child, new_prefix, memo)
    if len(root.getchildren()) == 0:
        memo[new_prefix] = root.text
    return memo

e = ElementTree.parse("data.xml")
nodes = crawl(e.getroot())
for k, v in nodes.items():
    print k, v

crawl 最初接受 xml 树的根。然后它遍历它的所有 children （递归地）跟踪它经过的所有标签到达那里（这是整个前缀的事情）。当它最终找到一个没有 children 的元素时，它会将数据保存在 memo.

中

部分输出：

clinical_study_intervention_intervention_name Control clinical_study_phase
N/A clinical_study_arm_group_arm_group_type Other 
clinical_study_id_info_nct_id NCT00753818

Answer 3

使用iter方法。

http://lxml.de/api/lxml.etree._Element-class.html#iter

这是一个功能示例。

#!/usr/bin/python
from lxml import etree

xml='''
<book>
    <chapter id="113">

        <sentence id="1" drums='Neil'>
            <word id="128160" bass='Geddy'>
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPV"/>
                <Number type="S"/>
            </word>
            <word id="128161">
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPF"/>
            </word>
        </sentence>

        <sentence id="2">
            <word id="128162">
                <POS Tag="P"/>
                <grammar type="PREFIX"/>
                <Tag Tag="bi+"/>
            </word>
        </sentence>

    </chapter>
</book>
'''

filename='/usr/share/sri/configurations/saved/test1.xml'

if __name__ == '__main__':
    root = etree.fromstring(xml)

    # iter will return every node in the document
    #
    for node in root.iter('*'):

        # nodes of length zero are leaf nodes
        #
        if 0 ==  len(node):
            print node

这是输出：

$ ./verifyXmlWithDirs.py
<Element POS at 0x176dcf8>
<Element grammar at 0x176da70>
<Element Aspect at 0x176dc20>
<Element Number at 0x176dcf8>
<Element POS at 0x176dc20>
<Element grammar at 0x176dcf8>
<Element Aspect at 0x176da70>
<Element POS at 0x176da70>
<Element grammar at 0x176dc20>
<Element Tag at 0x176dcf8>

lxml：获取所有叶节点？

lxml: Get all leaf nodes?

python

xml

lxml