lxml:获取所有叶节点?
lxml: Get all leaf nodes?
给一个XML文件,有没有办法用lxml
得到所有叶子节点的名字和属性?
这是感兴趣的 XML 文件:
<?xml version="1.0" encoding="UTF-8"?>
<clinical_study>
<!-- This xml conforms to an XML Schema at:
http://clinicaltrials.gov/ct2/html/images/info/public.xsd
and an XML DTD at:
http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
<id_info>
<org_study_id>3370-2(-4)</org_study_id>
<nct_id>NCT00753818</nct_id>
<nct_alias>NCT00222157</nct_alias>
</id_info>
<brief_title>Developmental Effects of Infant Formula Supplemented With LCPUFA</brief_title>
<sponsors>
<lead_sponsor>
<agency>Mead Johnson Nutrition</agency>
<agency_class>Industry</agency_class>
</lead_sponsor>
</sponsors>
<source>Mead Johnson Nutrition</source>
<oversight_info>
<authority>United States: Institutional Review Board</authority>
</oversight_info>
<brief_summary>
<textblock>
The purpose of this study is to compare the effects on visual development, growth, cognitive
development, tolerance, and blood chemistry parameters in term infants fed one of four study
formulas containing various levels of DHA and ARA.
</textblock>
</brief_summary>
<overall_status>Completed</overall_status>
<phase>N/A</phase>
<study_type>Interventional</study_type>
<study_design>N/A</study_design>
<primary_outcome>
<measure>visual development</measure>
</primary_outcome>
<secondary_outcome>
<measure>Cognitive development</measure>
</secondary_outcome>
<number_of_arms>4</number_of_arms>
<condition>Cognitive Development</condition>
<condition>Growth</condition>
<arm_group>
<arm_group_label>1</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>2</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>3</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>4</arm_group_label>
<arm_group_type>Other</arm_group_type>
<description>Control</description>
</arm_group>
<intervention>
<intervention_type>Other</intervention_type>
<intervention_name>DHA and ARA</intervention_name>
<description>various levels of DHA and ARA</description>
<arm_group_label>1</arm_group_label>
<arm_group_label>2</arm_group_label>
<arm_group_label>3</arm_group_label>
</intervention>
<intervention>
<intervention_type>Other</intervention_type>
<intervention_name>Control</intervention_name>
<arm_group_label>4</arm_group_label>
</intervention>
</clinical_study>
我想要的是这样的字典:
{
'id_info_org_study_id': '3370-2(-4)',
'id_info_nct_id': 'NCT00753818',
'id_info_nct_alias': 'NCT00222157',
'brief_title': 'Developmental Effects...'
}
这对 lxml 或任何其他 Python 库是否可行?
更新:
我最后是这样做的:
response = requests.get(url)
tree = lxml.etree.fromstring(response.content)
mydict = self._recurse_over_nodes(tree, None, {})
def _recurse_over_nodes(self, tree, parent_key, data):
for branch in tree:
key = branch.tag
if branch.getchildren():
if parent_key:
key = '%s_%s' % (parent_key, key)
data = self._recurse_over_nodes(branch, key, data)
else:
if parent_key:
key = '%s_%s' % (parent_key, key)
if key in data:
data[key] = data[key] + ', %s' % branch.text
else:
data[key] = branch.text
return data
假设你已经完成了getroot()
,像下面这样简单的事情就可以构建一个你期望的字典:
import lxml.etree
tree = lxml.etree.parse('sample_ctgov.xml')
root = tree.getroot()
d = {}
for node in root:
key = node.tag
if node.getchildren():
for child in node:
key += '_' + child.tag
d.update({key: child.text})
else:
d.update({key: node.text})
应该可以解决问题,既不优化也不递归搜索所有子节点,但您知道从哪里开始。
试试这个:
from xml.etree import ElementTree
def crawl(root, prefix='', memo={}):
new_prefix = root.tag
if len(prefix) > 0:
new_prefix = prefix + "_" + new_prefix
for child in root.getchildren():
crawl(child, new_prefix, memo)
if len(root.getchildren()) == 0:
memo[new_prefix] = root.text
return memo
e = ElementTree.parse("data.xml")
nodes = crawl(e.getroot())
for k, v in nodes.items():
print k, v
crawl
最初接受 xml 树的根。然后它遍历它的所有 children (递归地)跟踪它经过的所有标签到达那里(这是整个前缀的事情)。当它最终找到一个没有 children 的元素时,它会将数据保存在 memo
.
中
部分输出:
clinical_study_intervention_intervention_name Control clinical_study_phase
N/A clinical_study_arm_group_arm_group_type Other
clinical_study_id_info_nct_id NCT00753818
使用iter
方法。
http://lxml.de/api/lxml.etree._Element-class.html#iter
这是一个功能示例。
#!/usr/bin/python
from lxml import etree
xml='''
<book>
<chapter id="113">
<sentence id="1" drums='Neil'>
<word id="128160" bass='Geddy'>
<POS Tag="V"/>
<grammar type="STEM"/>
<Aspect type="IMPV"/>
<Number type="S"/>
</word>
<word id="128161">
<POS Tag="V"/>
<grammar type="STEM"/>
<Aspect type="IMPF"/>
</word>
</sentence>
<sentence id="2">
<word id="128162">
<POS Tag="P"/>
<grammar type="PREFIX"/>
<Tag Tag="bi+"/>
</word>
</sentence>
</chapter>
</book>
'''
filename='/usr/share/sri/configurations/saved/test1.xml'
if __name__ == '__main__':
root = etree.fromstring(xml)
# iter will return every node in the document
#
for node in root.iter('*'):
# nodes of length zero are leaf nodes
#
if 0 == len(node):
print node
这是输出:
$ ./verifyXmlWithDirs.py
<Element POS at 0x176dcf8>
<Element grammar at 0x176da70>
<Element Aspect at 0x176dc20>
<Element Number at 0x176dcf8>
<Element POS at 0x176dc20>
<Element grammar at 0x176dcf8>
<Element Aspect at 0x176da70>
<Element POS at 0x176da70>
<Element grammar at 0x176dc20>
<Element Tag at 0x176dcf8>
给一个XML文件,有没有办法用lxml
得到所有叶子节点的名字和属性?
这是感兴趣的 XML 文件:
<?xml version="1.0" encoding="UTF-8"?>
<clinical_study>
<!-- This xml conforms to an XML Schema at:
http://clinicaltrials.gov/ct2/html/images/info/public.xsd
and an XML DTD at:
http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
<id_info>
<org_study_id>3370-2(-4)</org_study_id>
<nct_id>NCT00753818</nct_id>
<nct_alias>NCT00222157</nct_alias>
</id_info>
<brief_title>Developmental Effects of Infant Formula Supplemented With LCPUFA</brief_title>
<sponsors>
<lead_sponsor>
<agency>Mead Johnson Nutrition</agency>
<agency_class>Industry</agency_class>
</lead_sponsor>
</sponsors>
<source>Mead Johnson Nutrition</source>
<oversight_info>
<authority>United States: Institutional Review Board</authority>
</oversight_info>
<brief_summary>
<textblock>
The purpose of this study is to compare the effects on visual development, growth, cognitive
development, tolerance, and blood chemistry parameters in term infants fed one of four study
formulas containing various levels of DHA and ARA.
</textblock>
</brief_summary>
<overall_status>Completed</overall_status>
<phase>N/A</phase>
<study_type>Interventional</study_type>
<study_design>N/A</study_design>
<primary_outcome>
<measure>visual development</measure>
</primary_outcome>
<secondary_outcome>
<measure>Cognitive development</measure>
</secondary_outcome>
<number_of_arms>4</number_of_arms>
<condition>Cognitive Development</condition>
<condition>Growth</condition>
<arm_group>
<arm_group_label>1</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>2</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>3</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>4</arm_group_label>
<arm_group_type>Other</arm_group_type>
<description>Control</description>
</arm_group>
<intervention>
<intervention_type>Other</intervention_type>
<intervention_name>DHA and ARA</intervention_name>
<description>various levels of DHA and ARA</description>
<arm_group_label>1</arm_group_label>
<arm_group_label>2</arm_group_label>
<arm_group_label>3</arm_group_label>
</intervention>
<intervention>
<intervention_type>Other</intervention_type>
<intervention_name>Control</intervention_name>
<arm_group_label>4</arm_group_label>
</intervention>
</clinical_study>
我想要的是这样的字典:
{
'id_info_org_study_id': '3370-2(-4)',
'id_info_nct_id': 'NCT00753818',
'id_info_nct_alias': 'NCT00222157',
'brief_title': 'Developmental Effects...'
}
这对 lxml 或任何其他 Python 库是否可行?
更新:
我最后是这样做的:
response = requests.get(url)
tree = lxml.etree.fromstring(response.content)
mydict = self._recurse_over_nodes(tree, None, {})
def _recurse_over_nodes(self, tree, parent_key, data):
for branch in tree:
key = branch.tag
if branch.getchildren():
if parent_key:
key = '%s_%s' % (parent_key, key)
data = self._recurse_over_nodes(branch, key, data)
else:
if parent_key:
key = '%s_%s' % (parent_key, key)
if key in data:
data[key] = data[key] + ', %s' % branch.text
else:
data[key] = branch.text
return data
假设你已经完成了getroot()
,像下面这样简单的事情就可以构建一个你期望的字典:
import lxml.etree
tree = lxml.etree.parse('sample_ctgov.xml')
root = tree.getroot()
d = {}
for node in root:
key = node.tag
if node.getchildren():
for child in node:
key += '_' + child.tag
d.update({key: child.text})
else:
d.update({key: node.text})
应该可以解决问题,既不优化也不递归搜索所有子节点,但您知道从哪里开始。
试试这个:
from xml.etree import ElementTree
def crawl(root, prefix='', memo={}):
new_prefix = root.tag
if len(prefix) > 0:
new_prefix = prefix + "_" + new_prefix
for child in root.getchildren():
crawl(child, new_prefix, memo)
if len(root.getchildren()) == 0:
memo[new_prefix] = root.text
return memo
e = ElementTree.parse("data.xml")
nodes = crawl(e.getroot())
for k, v in nodes.items():
print k, v
crawl
最初接受 xml 树的根。然后它遍历它的所有 children (递归地)跟踪它经过的所有标签到达那里(这是整个前缀的事情)。当它最终找到一个没有 children 的元素时,它会将数据保存在 memo
.
部分输出:
clinical_study_intervention_intervention_name Control clinical_study_phase
N/A clinical_study_arm_group_arm_group_type Other
clinical_study_id_info_nct_id NCT00753818
使用iter
方法。
http://lxml.de/api/lxml.etree._Element-class.html#iter
这是一个功能示例。
#!/usr/bin/python
from lxml import etree
xml='''
<book>
<chapter id="113">
<sentence id="1" drums='Neil'>
<word id="128160" bass='Geddy'>
<POS Tag="V"/>
<grammar type="STEM"/>
<Aspect type="IMPV"/>
<Number type="S"/>
</word>
<word id="128161">
<POS Tag="V"/>
<grammar type="STEM"/>
<Aspect type="IMPF"/>
</word>
</sentence>
<sentence id="2">
<word id="128162">
<POS Tag="P"/>
<grammar type="PREFIX"/>
<Tag Tag="bi+"/>
</word>
</sentence>
</chapter>
</book>
'''
filename='/usr/share/sri/configurations/saved/test1.xml'
if __name__ == '__main__':
root = etree.fromstring(xml)
# iter will return every node in the document
#
for node in root.iter('*'):
# nodes of length zero are leaf nodes
#
if 0 == len(node):
print node
这是输出:
$ ./verifyXmlWithDirs.py
<Element POS at 0x176dcf8>
<Element grammar at 0x176da70>
<Element Aspect at 0x176dc20>
<Element Number at 0x176dcf8>
<Element POS at 0x176dc20>
<Element grammar at 0x176dcf8>
<Element Aspect at 0x176da70>
<Element POS at 0x176da70>
<Element grammar at 0x176dc20>
<Element Tag at 0x176dcf8>