ElementTree 使用 AND 和 'parent' 搜索节点 (XPATH) 的更好方法
ElementTree better way to search out nodes (XPATH) using AND and 'parent'
我需要找到符合 2 个条件的 tag=ITEM,然后根据这个查找得到父 tag=NODE@name。
两期:
我找不到让 XPath 执行 'and' 的方法,例如
item = node.findall('./ITEM[@name="toppas_type" and @value="output file list"]')
获取父 NODE 信息而无需在查找 ITEM 之前明确搜索和保存它,例如
parent_name = item.parent.attrib['name']
这是我现在的代码:
node_names = []
for node in tree.findall('NODE[@name="vertices"]/NODE'):
for item in node.findall('./ITEM[@name="toppas_type"]'):
if item.attrib['name'] == 'toppas_type' and item.attrib['value'] == 'output file list':
node_names.append(node.attrib['name'])
...解析这样的文件(仅限代码段)...
<?xml version="1.0" encoding="ISO-8859-1"?>
<PARAMETERS version="1.6.2" xsi:noNamespaceSchemaLocation="http://open-ms.sourceforge.net/schemas/Param_1_6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<NODE name="vertices" description="">
<NODE name="23" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
</NODE>
<NODE name="24" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
<ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
</NODE>
<NODE name="33" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
<ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
</NODE>
<!--(snip)-->
</NODE>
</PARAMETERS>
更新:
@Mathias Müller
很好的建议 - 不幸的是,当我尝试加载 XML 文件时,出现错误。我不熟悉 lxml...所以我不确定我是否正确使用它。
from lxml import etree
root = etree.DTD("/Users/mikes/Documents/Eclipseworkspace/Bioproximity/Assay-Workflows-Mikes/protein_lfq/protein_lfq-1.1.2.toppas")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/dtd.pxi", line 294, in lxml.etree.DTD.__init__ (src/lxml/lxml.etree.c:187024)
lxml.etree.DTDParseError: Content error in the external subset, line 2, column 1
不幸的是,ElementTree 不会在其 tree.find(xpath) 或 tree.findall(xpath)
中接受 xpath
也许您根本不需要嵌套循环,单个 XPath 表达式就足够了。我不太确定你希望最终结果是什么,但这里有一个 lxml
:
的例子
>>> import lxml.etree
>>> s = '''<NODE name="vertices" description="">
...
... <NODE name="23" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="24" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
... <ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="33" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
... <ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
... </NODE>
... <!--(snip)-->
... </NODE>'''
>>> root = lxml.etree.fromstring(s)
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x102b5f788>]
如果你确实需要父元素的名称,你可以移动到父节点 ..
:
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]/../@name')
['24']
从文件中解析 XML 文档
如果您想从文件中解析 XML 文档,函数 etree.DTD
是错误的选择。 DTD 不是 XML 文档。以下是如何使用 lxml
:
>>> import lxml.etree
>>> root = lxml.etree.parse("example.xml")
>>> root
<lxml.etree._ElementTree object at 0x106593b00>
第二次更新
如果最外层元素是PARAMETERS
,则需要这样查找:
>>> root.xpath('/PARAMETERS/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x106593e18>]
我需要找到符合 2 个条件的 tag=ITEM,然后根据这个查找得到父 tag=NODE@name。
两期:
我找不到让 XPath 执行 'and' 的方法,例如
item = node.findall('./ITEM[@name="toppas_type" and @value="output file list"]')
获取父 NODE 信息而无需在查找 ITEM 之前明确搜索和保存它,例如
parent_name = item.parent.attrib['name']
这是我现在的代码:
node_names = []
for node in tree.findall('NODE[@name="vertices"]/NODE'):
for item in node.findall('./ITEM[@name="toppas_type"]'):
if item.attrib['name'] == 'toppas_type' and item.attrib['value'] == 'output file list':
node_names.append(node.attrib['name'])
...解析这样的文件(仅限代码段)...
<?xml version="1.0" encoding="ISO-8859-1"?>
<PARAMETERS version="1.6.2" xsi:noNamespaceSchemaLocation="http://open-ms.sourceforge.net/schemas/Param_1_6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<NODE name="vertices" description="">
<NODE name="23" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
</NODE>
<NODE name="24" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
<ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
</NODE>
<NODE name="33" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
<ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
</NODE>
<!--(snip)-->
</NODE>
</PARAMETERS>
更新:
@Mathias Müller
很好的建议 - 不幸的是,当我尝试加载 XML 文件时,出现错误。我不熟悉 lxml...所以我不确定我是否正确使用它。
from lxml import etree
root = etree.DTD("/Users/mikes/Documents/Eclipseworkspace/Bioproximity/Assay-Workflows-Mikes/protein_lfq/protein_lfq-1.1.2.toppas")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/dtd.pxi", line 294, in lxml.etree.DTD.__init__ (src/lxml/lxml.etree.c:187024)
lxml.etree.DTDParseError: Content error in the external subset, line 2, column 1
不幸的是,ElementTree 不会在其 tree.find(xpath) 或 tree.findall(xpath)
中接受 xpath也许您根本不需要嵌套循环,单个 XPath 表达式就足够了。我不太确定你希望最终结果是什么,但这里有一个 lxml
:
>>> import lxml.etree
>>> s = '''<NODE name="vertices" description="">
...
... <NODE name="23" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="24" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
... <ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="33" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
... <ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
... </NODE>
... <!--(snip)-->
... </NODE>'''
>>> root = lxml.etree.fromstring(s)
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x102b5f788>]
如果你确实需要父元素的名称,你可以移动到父节点 ..
:
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]/../@name')
['24']
从文件中解析 XML 文档
如果您想从文件中解析 XML 文档,函数 etree.DTD
是错误的选择。 DTD 不是 XML 文档。以下是如何使用 lxml
:
>>> import lxml.etree
>>> root = lxml.etree.parse("example.xml")
>>> root
<lxml.etree._ElementTree object at 0x106593b00>
第二次更新
如果最外层元素是PARAMETERS
,则需要这样查找:
>>> root.xpath('/PARAMETERS/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x106593e18>]