迭代 XML 标签并在 Python 中获取元素的 xpath
Iterate on XML tags and get elements' xpath in Python
我想遍历 XML 文档中的每个“p”标签,并能够获取当前元素的 xpath,但我找不到任何可以做到这一点的东西。
我试过的代码类型:
from bs4 import BeautifulSoup
xml_file = open("./data.xml", "rb")
soup = BeautifulSoup(xml_file, "lxml")
for i in soup.find_all("p"):
print(i.xpath) # xpath doesn't work here (None)
print("\n")
这是我尝试解析的示例 XML 文件:
<?xml version="1.0" encoding="UTF-8"?>
<article>
<title>Sample document</title>
<body>
<p>This is a <b>sample document.</b></p>
<p>And there is another paragraph.</p>
</body>
</article>
我希望我的代码输出:
/article/body/p[0]
/article/body/p[1]
您可以使用 getpath() 从元素获取 xpath:
result = root.xpath('//*[. = "XML"]')
for r in result:
print(tree.getpath(r))
你可以试试这个功能:
doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
print b.text
def fast_iter(context, func, *args, **kwargs):
"""
fast_iter is useful if you need to free memory while iterating through a
very large XML file.
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elt):
print(elt.text)
context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)
有关更多参考,您可以在此处查看 - https://newbedev.com/efficient-way-to-iterate-through-xml-elements
下面是如何使用 Python 的 ElementTree class.
它使用一个简单的列表来跟踪迭代器通过 XML 的当前路径。每当您需要一个元素的 XPath 时,调用 gen_xpath()
将该列表转换为该元素的 XPath,并使用处理同名兄弟(绝对位置)的逻辑。
from xml.etree import ElementTree as ET
# A list of elements pushed and popped by the iterator's start and end events
path = []
def gen_xpath():
'''Start at the root of `path` and figure out if the next child is alone, or is one of many siblings named the same. If the next child is one of many same-named siblings determine its position.
Returns the full XPath up to the element in the iterator this function was called.
'''
full_path = '/' + path[0].tag
for i, parent_elem in enumerate(path[:-1]):
next_elem = path[i+1]
pos = -1 # acts as counter for all children named the same as next_elem
next_pos = None # the position we care about
for child_elem in parent_elem:
if child_elem.tag == next_elem.tag:
pos += 1
# Compare etree.Element identity
if child_elem == next_elem:
next_pos = pos
if next_pos and pos > 0:
# We know where next_elem is, and that there are many same-named siblings, no need to count others
break
# Use next_elem's pos only if there are other same-named siblings
if pos > 0:
full_path += f'/{next_elem.tag}[{next_pos}]'
else:
full_path += f'/{next_elem.tag}'
return full_path
# Iterate the XML
for event, elem in ET.iterparse('input.xml', ['start', 'end']):
if event == 'start':
path.append(elem)
if elem.tag == 'p':
print(gen_xpath())
if event == 'end':
path.pop()
当我 运行 在这个修改过的样本上 XML, input.xml:
<?xml version="1.0" encoding="UTF-8"?>
<article>
<title>Sample document</title>
<body>
<p>This is a <b>sample document.</b></p>
<p>And there is another paragraph.</p>
<section>
<p>Parafoo</p>
</section>
</body>
</article>
我得到:
/article/body/p[0]
/article/body/p[1]
/article/body/section/p
我想遍历 XML 文档中的每个“p”标签,并能够获取当前元素的 xpath,但我找不到任何可以做到这一点的东西。
我试过的代码类型:
from bs4 import BeautifulSoup
xml_file = open("./data.xml", "rb")
soup = BeautifulSoup(xml_file, "lxml")
for i in soup.find_all("p"):
print(i.xpath) # xpath doesn't work here (None)
print("\n")
这是我尝试解析的示例 XML 文件:
<?xml version="1.0" encoding="UTF-8"?>
<article>
<title>Sample document</title>
<body>
<p>This is a <b>sample document.</b></p>
<p>And there is another paragraph.</p>
</body>
</article>
我希望我的代码输出:
/article/body/p[0]
/article/body/p[1]
您可以使用 getpath() 从元素获取 xpath:
result = root.xpath('//*[. = "XML"]')
for r in result:
print(tree.getpath(r))
你可以试试这个功能:
doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
print b.text
def fast_iter(context, func, *args, **kwargs):
"""
fast_iter is useful if you need to free memory while iterating through a
very large XML file.
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elt):
print(elt.text)
context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)
有关更多参考,您可以在此处查看 - https://newbedev.com/efficient-way-to-iterate-through-xml-elements
下面是如何使用 Python 的 ElementTree class.
它使用一个简单的列表来跟踪迭代器通过 XML 的当前路径。每当您需要一个元素的 XPath 时,调用 gen_xpath()
将该列表转换为该元素的 XPath,并使用处理同名兄弟(绝对位置)的逻辑。
from xml.etree import ElementTree as ET
# A list of elements pushed and popped by the iterator's start and end events
path = []
def gen_xpath():
'''Start at the root of `path` and figure out if the next child is alone, or is one of many siblings named the same. If the next child is one of many same-named siblings determine its position.
Returns the full XPath up to the element in the iterator this function was called.
'''
full_path = '/' + path[0].tag
for i, parent_elem in enumerate(path[:-1]):
next_elem = path[i+1]
pos = -1 # acts as counter for all children named the same as next_elem
next_pos = None # the position we care about
for child_elem in parent_elem:
if child_elem.tag == next_elem.tag:
pos += 1
# Compare etree.Element identity
if child_elem == next_elem:
next_pos = pos
if next_pos and pos > 0:
# We know where next_elem is, and that there are many same-named siblings, no need to count others
break
# Use next_elem's pos only if there are other same-named siblings
if pos > 0:
full_path += f'/{next_elem.tag}[{next_pos}]'
else:
full_path += f'/{next_elem.tag}'
return full_path
# Iterate the XML
for event, elem in ET.iterparse('input.xml', ['start', 'end']):
if event == 'start':
path.append(elem)
if elem.tag == 'p':
print(gen_xpath())
if event == 'end':
path.pop()
当我 运行 在这个修改过的样本上 XML, input.xml:
<?xml version="1.0" encoding="UTF-8"?>
<article>
<title>Sample document</title>
<body>
<p>This is a <b>sample document.</b></p>
<p>And there is another paragraph.</p>
<section>
<p>Parafoo</p>
</section>
</body>
</article>
我得到:
/article/body/p[0]
/article/body/p[1]
/article/body/section/p