Python lxml 未正确读取 XML
Python lxml not read XML properlly
我正在使用 Python 2.7(遗憾的是我无法升级到任何新版本)并且我正在尝试使用 lxml
解析 2 个 XML 文件,但有些地方不对我不确定我做错了什么:
代码:
from lxml import etree as ET
def string_to_lxml(string):
xml_file = bytes(bytearray(string, encoding='utf-8'))
return ET.XML(xml_file)
def find_all(tag, atr):
return tag.xpath("//%s" % atr)
xml_str_1 = """<?xml version="1.0" encoding="UTF-8"?>
<A xmlns="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0">
<B name="SOME_NAME_0">
<C/>
<D>SOME NAME</D>
<AA>
<dir name="include" filters="*.h *.hpp *.tpp *.i"/>
</AA>
<H>
<TAG_1 name="main" default="true"/>
</H>
</B>
<TT>
<GG>
<FF configs="main">
<TAG_2 name="NAME_1"/>
<TAG_2 name="NAME_2"/>
<TAG_3 name="NAME_3"/>
<TAG_3 name="NAME_4"/>
<TAG_3 name="NAME_5"/>
</FF>
</GG>
</TT>
</A>"""
xml_str_2 = """<?xml version='1.0' encoding='UTF-8'?>
<A xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://obe.nce.amadeus.net/bms/metadata/1-0/">
<B name="NAME" version="VERSION">
<AA>SOME NAME</AA>
<CC>SOME OTHER NAME</CC>
</B>
<C>
<TAG_3 name="NAME_1" path="path_1"/>
<TAG_3 name="NAME_2" path="path_2"/>
<TAG_3 name="NAME_3" path="path_3"/>
</C>
<D>
<TAG_3 type="type" name="NAME_1" version="version_1"/>
<TAG_3 type="type" name="NAME_2" version="version_2"/>
<TAG_3 type="type" name="NAME_3" version="version_3"/>
</D>
</A>
"""
root = string_to_lxml(xml_str_1)
print(find_all(root, "TAG_3"))
root = string_to_lxml(xml_str_2)
print(find_all(root, "TAG_3"))
输出:
[]
[<Element TAG_3 at 0x7f257c126640>, <Element TAG_3 at 0x7f257c126be0>, <Element TAG_3 at 0x7f257c126b90>, <Element TAG_3 at 0x7f257c126e10>, <Element TAG_3 at 0x7f257c128730>, <Element TAG_3 at 0x7f257c128640>]
我是否以错误的方式解析了 XML?
首先XML定义了一个必须考虑的匿名命名空间
xmlns="http://www.w3.org/2001/XMLSchema-instance"
为此,xpath表达式可以表示如下
def find_all(tag, atr):
return tag.xpath("//*[local-name()= '%s']" % atr)
结果:
[<Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73de88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73df88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73dfc8>]
[<Element TAG_3 at 0x7f39cf73df88>, <Element TAG_3 at 0x7f39cf73dfc8>, <Element TAG_3 at 0x7f39cf73dec8>, <Element TAG_3 at 0x7f39cf762048>, <Element TAG_3 at 0x7f39cf762088>, <Element TAG_3 at 0x7f39cf762108>]
我正在使用 Python 2.7(遗憾的是我无法升级到任何新版本)并且我正在尝试使用 lxml
解析 2 个 XML 文件,但有些地方不对我不确定我做错了什么:
代码:
from lxml import etree as ET
def string_to_lxml(string):
xml_file = bytes(bytearray(string, encoding='utf-8'))
return ET.XML(xml_file)
def find_all(tag, atr):
return tag.xpath("//%s" % atr)
xml_str_1 = """<?xml version="1.0" encoding="UTF-8"?>
<A xmlns="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0">
<B name="SOME_NAME_0">
<C/>
<D>SOME NAME</D>
<AA>
<dir name="include" filters="*.h *.hpp *.tpp *.i"/>
</AA>
<H>
<TAG_1 name="main" default="true"/>
</H>
</B>
<TT>
<GG>
<FF configs="main">
<TAG_2 name="NAME_1"/>
<TAG_2 name="NAME_2"/>
<TAG_3 name="NAME_3"/>
<TAG_3 name="NAME_4"/>
<TAG_3 name="NAME_5"/>
</FF>
</GG>
</TT>
</A>"""
xml_str_2 = """<?xml version='1.0' encoding='UTF-8'?>
<A xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://obe.nce.amadeus.net/bms/metadata/1-0/">
<B name="NAME" version="VERSION">
<AA>SOME NAME</AA>
<CC>SOME OTHER NAME</CC>
</B>
<C>
<TAG_3 name="NAME_1" path="path_1"/>
<TAG_3 name="NAME_2" path="path_2"/>
<TAG_3 name="NAME_3" path="path_3"/>
</C>
<D>
<TAG_3 type="type" name="NAME_1" version="version_1"/>
<TAG_3 type="type" name="NAME_2" version="version_2"/>
<TAG_3 type="type" name="NAME_3" version="version_3"/>
</D>
</A>
"""
root = string_to_lxml(xml_str_1)
print(find_all(root, "TAG_3"))
root = string_to_lxml(xml_str_2)
print(find_all(root, "TAG_3"))
输出:
[]
[<Element TAG_3 at 0x7f257c126640>, <Element TAG_3 at 0x7f257c126be0>, <Element TAG_3 at 0x7f257c126b90>, <Element TAG_3 at 0x7f257c126e10>, <Element TAG_3 at 0x7f257c128730>, <Element TAG_3 at 0x7f257c128640>]
我是否以错误的方式解析了 XML?
首先XML定义了一个必须考虑的匿名命名空间
xmlns="http://www.w3.org/2001/XMLSchema-instance"
为此,xpath表达式可以表示如下
def find_all(tag, atr):
return tag.xpath("//*[local-name()= '%s']" % atr)
结果:
[<Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73de88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73df88>, <Element {http://www.w3.org/2001/XMLSchema-instance}TAG_3 at 0x7f39cf73dfc8>]
[<Element TAG_3 at 0x7f39cf73df88>, <Element TAG_3 at 0x7f39cf73dfc8>, <Element TAG_3 at 0x7f39cf73dec8>, <Element TAG_3 at 0x7f39cf762048>, <Element TAG_3 at 0x7f39cf762088>, <Element TAG_3 at 0x7f39cf762108>]